New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read document as stream #924
Comments
You can pass the |
Not quite - that will crawl the node tree, writing to the TextWriter (and underlying Stream) all at once. What I'm thinking about is reading as a stream on the fly. The way I think it might work is a special stream that uses a buffer and descends the nodes, filling the buffer and reading from that as it goes. If you only want to read the first n bytes for example, the entire tree doesn't have to be descended and serialized. Does that make any sense? It's entirely possible I'm talking myself in circles :) |
Hm I am not sure. If you want to consume HTML you can just pass in a Maybe I don't see what you want / where you are after. But you'd always need to descend the tree. If you want to stop at a specific point in time then I'd suggest having your own Edit: Right now I have a hard time understanding if you want to parse it from a stream ("read as stream") or output to a stream (that would rather be "write as a stream" or "stream out", the last two have nothing directly to do with the DOM as the DOM is an object model, a serialization happens via |
Maybe it would help if I explained the specific use case. In Statiq each document has a "content provider" that can be used to read the content of that document by anything that needs to process it. For example, a document based on a file from disk will have a content provider that reads the file as a stream. A document created from a string will have a StringReader-based content provider and so on. In the course of processing documents, some operations result in constructing an AngleSharp IHtmlDocument instance and then mutating it. Right now I serialize that document back out to either disk or a MemoryStream so that the content can be provided later. I’ve noticed some memory pressure due to this cycle - I have some content (file, string, etc.), create an AngleSharp node tree and do some stuff, serialize to a string, then do something later by reading that HTML content that I serialized as a stream later on. I’d like to cut out the middle part where I have to serialize the HTML to an intermediate object (string, MemoryStream, file, etc.) - I’ve already allocated everything I need to read it within the node tree and the extra holding object for the fully serialized content seems unnecessary. Instead I want to wrap the IHtmlDocument as a content provider itself and provide a readable stream that descends the nodes directly whenever something else needs to “read” it. Did that make any more sense? I’m planning on looking at this myself, so more wondering if you’ve got any tips rather than an actual feature request. |
New Feature Proposal
Description
Right now we can take a document (or node) and write it somewhere using
.ToHtml()
. What I'd like to do is avoid copying the document to something altogether and pull from the document structure instead. I.e. something like.ReadAsStream()
. Then I'll be able to pass around myIHtmlDocument
once I've gone to the trouble of parsing it and avoid the allocations and extra work of serializing it back to a string.Maybe I'm missing how this can be done out of the box? Or perhaps you've got some pointers if I take this on?
The text was updated successfully, but these errors were encountered: