Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast parsing and selector query #97

Open
VoidMonk opened this issue Jun 17, 2015 · 5 comments
Open

Fast parsing and selector query #97

VoidMonk opened this issue Jun 17, 2015 · 5 comments

Comments

@VoidMonk
Copy link

Hi,

I'm trying the following code, but in some cases either the parsing or selector query (or both) are slower than CsQuery:

var config = new Configuration().WithDefaultLoader();
AngleSharp.Dom.Html.IHtmlDocument document;
AngleSharp.Dom.IHtmlCollection<AngleSharp.Dom.IElement> elements;

//  either the parsing or selector query (or both) are slower than CsQuery
document = new AngleSharp.Parser.Html.HtmlParser(html, config).Parse();
elements = document.QuerySelectorAll(selectorPath);

What configuration or other optimizations would result in the fastest parsing and selector query in AngleSharp? If it matters, I don't want to parse the CSS or JS, just the HTML.

@FlorianRappl
Copy link
Contributor

How are you doing your measurements? AngleSharp is larger and the code inside takes longer to JIT, so using NGEN or some other technique will certainly reduce first-time overhead.

Also it could be that the selector query takes longer in AngleSharp v0.8.5, since the CSS parser got (a lot) slower (maybe even a magnitude). It was expected to be slower, but not that severely. So optimizations will take place here. This may also affect the creation of the ISelector instances, since the CSS parser (mostly the tokenizer) is used for that.

What CsQuery currently does (and what AngleSharp will do in the future) is to use some sophisticated hashing to make queries even faster. So it could also just be that the case you are looking at (most probably a [very] large page) is just a perfect example where this hashing is beneficial.

Hope this helps a bit.

@FlorianRappl FlorianRappl added this to the v0.9 milestone Jun 17, 2015
@VoidMonk
Copy link
Author

Thanks for sharing your thoughts.

I used System.Diagnostics.Stopwatch (high-resolution timer) to measure the processing time and GC.GetTotalMemory to measure the memory usage. System.Net.WebClient was used to download (not part of the benchmark) the webpages (not the lib's in-built methods) in the test code. AngleSharp (HtmlParser and QuerySelectorAll) took longer and used more memory than CsQuery (with Simple Index) in the following test webpages (mix of small and large):

http://www.amazon.com (370Kb)
http://www.reddit.com (108Kb)
http://www.w3.org/TR/html5/single-page.html (5810Kb)
http://en.wikipedia.org/wiki/South_African_labour_law (576Kb)
http://www.time.com (108Kb)

Separate tests were performed using two arbitrary selector queries ('a[href]' and 'div > p > a') for each webpage.

I think AngleSharp is promising and it's good to see it being continuously improved. Although CsQuery is no longer being maintained (ref: jamietre/CsQuery#173), but it seems to perform better. I'm looking forward to further improvements to AngleSharp's CSS parser and then run our tests again to evaluate its use in our product.

@FlorianRappl
Copy link
Contributor

So your test is basically a combination of parsing and querying? I think this is an interesting and practical scenario. Will be taken into consideration and used for further improvements.

@FlorianRappl
Copy link
Contributor

Hm still I think the code you are using is not JIT-preprocessed. Also do you use warm-up iterations or multiple runs to exclude / minimize the effect of outliers? I am just asking, since I set up a combination test and I get overall different results.

What is definitely true, however, is that the very large sites (especially the single page version of the HTML5 spec) perform better with CsQuery. Right now it seems that for such a large page the parser in CsQuery is performing much better (so the selectors may not even play such a critical role here). I will definitely investigate this and try to come up with an improved version.

@VoidMonk
Copy link
Author

Yes, my test code is a combination of parsing and querying, but I'm measuring/comparing the time taken for both separately (not as a single task). Most of the bigger time differences occur during parsing. Any time differences in querying are within a smaller margin comparatively, often negligible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants