Read document as stream #924

daveaglick · 2021-01-01T17:17:59Z

New Feature Proposal

Description

Right now we can take a document (or node) and write it somewhere using .ToHtml(). What I'd like to do is avoid copying the document to something altogether and pull from the document structure instead. I.e. something like .ReadAsStream(). Then I'll be able to pass around my IHtmlDocument once I've gone to the trouble of parsing it and avoid the allocations and extra work of serializing it back to a string.

Maybe I'm missing how this can be done out of the box? Or perhaps you've got some pointers if I take this on?

The text was updated successfully, but these errors were encountered:

FlorianRappl · 2021-01-01T18:40:08Z

You can pass the ToHtml a TextWriter instance. A TextWriter may hold an underlying stream. Is that what you've been looking for?

daveaglick · 2021-01-01T18:55:58Z

Not quite - that will crawl the node tree, writing to the TextWriter (and underlying Stream) all at once. What I'm thinking about is reading as a stream on the fly.

The way I think it might work is a special stream that uses a buffer and descends the nodes, filling the buffer and reading from that as it goes. If you only want to read the first n bytes for example, the entire tree doesn't have to be descended and serialized.

Does that make any sense? It's entirely possible I'm talking myself in circles :)

FlorianRappl · 2021-01-01T18:59:01Z

Hm I am not sure. If you want to consume HTML you can just pass in a Stream. The HtmlParser and all related code (e.g., BrowsingContext's OpenAsync) all support reading from a Stream.

Maybe I don't see what you want / where you are after. But you'd always need to descend the tree. If you want to stop at a specific point in time then I'd suggest having your own IMarkupFormatter for this. Here you can stop any time.

Edit: Right now I have a hard time understanding if you want to parse it from a stream ("read as stream") or output to a stream (that would rather be "write as a stream" or "stream out", the last two have nothing directly to do with the DOM as the DOM is an object model, a serialization happens via ToHtml which is in the hand of the respective IMarkupFormatter instance).

daveaglick · 2021-01-01T19:16:11Z

Maybe it would help if I explained the specific use case. In Statiq each document has a "content provider" that can be used to read the content of that document by anything that needs to process it. For example, a document based on a file from disk will have a content provider that reads the file as a stream. A document created from a string will have a StringReader-based content provider and so on.

In the course of processing documents, some operations result in constructing an AngleSharp IHtmlDocument instance and then mutating it. Right now I serialize that document back out to either disk or a MemoryStream so that the content can be provided later. I’ve noticed some memory pressure due to this cycle - I have some content (file, string, etc.), create an AngleSharp node tree and do some stuff, serialize to a string, then do something later by reading that HTML content that I serialized as a stream later on. I’d like to cut out the middle part where I have to serialize the HTML to an intermediate object (string, MemoryStream, file, etc.) - I’ve already allocated everything I need to read it within the node tree and the extra holding object for the fully serialized content seems unnecessary. Instead I want to wrap the IHtmlDocument as a content provider itself and provide a readable stream that descends the nodes directly whenever something else needs to “read” it.

Did that make any more sense? I’m planning on looking at this myself, so more wondering if you’ve got any tips rather than an actual feature request.

daveaglick added the enhancement label Jan 1, 2021

FlorianRappl added this to the 0.17.0 milestone Dec 1, 2021

FlorianRappl removed this from the 0.17.0 milestone May 31, 2022

FlorianRappl added help-wanted up-for-grabs more-infos-needed labels May 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read document as stream #924

Read document as stream #924

daveaglick commented Jan 1, 2021

FlorianRappl commented Jan 1, 2021

daveaglick commented Jan 1, 2021

FlorianRappl commented Jan 1, 2021 •

edited

daveaglick commented Jan 1, 2021

Read document as stream #924

Read document as stream #924

Comments

daveaglick commented Jan 1, 2021

New Feature Proposal

Description

FlorianRappl commented Jan 1, 2021

daveaglick commented Jan 1, 2021

FlorianRappl commented Jan 1, 2021 • edited

daveaglick commented Jan 1, 2021

FlorianRappl commented Jan 1, 2021 •

edited