Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read document as stream #924

Open
daveaglick opened this issue Jan 1, 2021 · 4 comments
Open

Read document as stream #924

daveaglick opened this issue Jan 1, 2021 · 4 comments

Comments

@daveaglick
Copy link
Contributor

New Feature Proposal

Description

Right now we can take a document (or node) and write it somewhere using .ToHtml(). What I'd like to do is avoid copying the document to something altogether and pull from the document structure instead. I.e. something like .ReadAsStream(). Then I'll be able to pass around my IHtmlDocument once I've gone to the trouble of parsing it and avoid the allocations and extra work of serializing it back to a string.

Maybe I'm missing how this can be done out of the box? Or perhaps you've got some pointers if I take this on?

@FlorianRappl
Copy link
Contributor

You can pass the ToHtml a TextWriter instance. A TextWriter may hold an underlying stream. Is that what you've been looking for?

@daveaglick
Copy link
Contributor Author

Not quite - that will crawl the node tree, writing to the TextWriter (and underlying Stream) all at once. What I'm thinking about is reading as a stream on the fly.

The way I think it might work is a special stream that uses a buffer and descends the nodes, filling the buffer and reading from that as it goes. If you only want to read the first n bytes for example, the entire tree doesn't have to be descended and serialized.

Does that make any sense? It's entirely possible I'm talking myself in circles :)

@FlorianRappl
Copy link
Contributor

FlorianRappl commented Jan 1, 2021

Hm I am not sure. If you want to consume HTML you can just pass in a Stream. The HtmlParser and all related code (e.g., BrowsingContext's OpenAsync) all support reading from a Stream.

Maybe I don't see what you want / where you are after. But you'd always need to descend the tree. If you want to stop at a specific point in time then I'd suggest having your own IMarkupFormatter for this. Here you can stop any time.

Edit: Right now I have a hard time understanding if you want to parse it from a stream ("read as stream") or output to a stream (that would rather be "write as a stream" or "stream out", the last two have nothing directly to do with the DOM as the DOM is an object model, a serialization happens via ToHtml which is in the hand of the respective IMarkupFormatter instance).

@daveaglick
Copy link
Contributor Author

Maybe it would help if I explained the specific use case. In Statiq each document has a "content provider" that can be used to read the content of that document by anything that needs to process it. For example, a document based on a file from disk will have a content provider that reads the file as a stream. A document created from a string will have a StringReader-based content provider and so on.

In the course of processing documents, some operations result in constructing an AngleSharp IHtmlDocument instance and then mutating it. Right now I serialize that document back out to either disk or a MemoryStream so that the content can be provided later. I’ve noticed some memory pressure due to this cycle - I have some content (file, string, etc.), create an AngleSharp node tree and do some stuff, serialize to a string, then do something later by reading that HTML content that I serialized as a stream later on. I’d like to cut out the middle part where I have to serialize the HTML to an intermediate object (string, MemoryStream, file, etc.) - I’ve already allocated everything I need to read it within the node tree and the extra holding object for the fully serialized content seems unnecessary. Instead I want to wrap the IHtmlDocument as a content provider itself and provide a readable stream that descends the nodes directly whenever something else needs to “read” it.

Did that make any more sense? I’m planning on looking at this myself, so more wondering if you’ve got any tips rather than an actual feature request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants