Fast parsing and selector query #97

VoidMonk · 2015-06-17T02:32:14Z

Hi,

I'm trying the following code, but in some cases either the parsing or selector query (or both) are slower than CsQuery:

var config = new Configuration().WithDefaultLoader();
AngleSharp.Dom.Html.IHtmlDocument document;
AngleSharp.Dom.IHtmlCollection<AngleSharp.Dom.IElement> elements;

//  either the parsing or selector query (or both) are slower than CsQuery
document = new AngleSharp.Parser.Html.HtmlParser(html, config).Parse();
elements = document.QuerySelectorAll(selectorPath);

What configuration or other optimizations would result in the fastest parsing and selector query in AngleSharp? If it matters, I don't want to parse the CSS or JS, just the HTML.

The text was updated successfully, but these errors were encountered:

FlorianRappl · 2015-06-17T05:42:53Z

How are you doing your measurements? AngleSharp is larger and the code inside takes longer to JIT, so using NGEN or some other technique will certainly reduce first-time overhead.

Also it could be that the selector query takes longer in AngleSharp v0.8.5, since the CSS parser got (a lot) slower (maybe even a magnitude). It was expected to be slower, but not that severely. So optimizations will take place here. This may also affect the creation of the ISelector instances, since the CSS parser (mostly the tokenizer) is used for that.

What CsQuery currently does (and what AngleSharp will do in the future) is to use some sophisticated hashing to make queries even faster. So it could also just be that the case you are looking at (most probably a [very] large page) is just a perfect example where this hashing is beneficial.

Hope this helps a bit.

VoidMonk · 2015-06-18T02:52:45Z

Thanks for sharing your thoughts.

I used System.Diagnostics.Stopwatch (high-resolution timer) to measure the processing time and GC.GetTotalMemory to measure the memory usage. System.Net.WebClient was used to download (not part of the benchmark) the webpages (not the lib's in-built methods) in the test code. AngleSharp (HtmlParser and QuerySelectorAll) took longer and used more memory than CsQuery (with Simple Index) in the following test webpages (mix of small and large):

http://www.amazon.com (370Kb)
http://www.reddit.com (108Kb)
http://www.w3.org/TR/html5/single-page.html (5810Kb)
http://en.wikipedia.org/wiki/South_African_labour_law (576Kb)
http://www.time.com (108Kb)

Separate tests were performed using two arbitrary selector queries ('a[href]' and 'div > p > a') for each webpage.

I think AngleSharp is promising and it's good to see it being continuously improved. Although CsQuery is no longer being maintained (ref: jamietre/CsQuery#173), but it seems to perform better. I'm looking forward to further improvements to AngleSharp's CSS parser and then run our tests again to evaluate its use in our product.

FlorianRappl · 2015-06-18T06:33:54Z

So your test is basically a combination of parsing and querying? I think this is an interesting and practical scenario. Will be taken into consideration and used for further improvements.

FlorianRappl · 2015-06-18T06:58:25Z

Hm still I think the code you are using is not JIT-preprocessed. Also do you use warm-up iterations or multiple runs to exclude / minimize the effect of outliers? I am just asking, since I set up a combination test and I get overall different results.

What is definitely true, however, is that the very large sites (especially the single page version of the HTML5 spec) perform better with CsQuery. Right now it seems that for such a large page the parser in CsQuery is performing much better (so the selectors may not even play such a critical role here). I will definitely investigate this and try to come up with an improved version.

VoidMonk · 2015-06-19T00:20:42Z

Yes, my test code is a combination of parsing and querying, but I'm measuring/comparing the time taken for both separately (not as a single task). Most of the bigger time differences occur during parsing. Any time differences in querying are within a smaller margin comparatively, often negligible.

FlorianRappl added the question label Jun 17, 2015

FlorianRappl added this to the v0.9 milestone Jun 17, 2015

FlorianRappl added the enhancement label Jun 17, 2015

FlorianRappl added a commit that referenced this issue Jun 18, 2015

Measure combined perf see #97

6b43c8e

FlorianRappl added a commit that referenced this issue Jun 18, 2015

Taken some of the sites from #97 for benchmarking

8bad0bd

FlorianRappl added a commit that referenced this issue Jun 18, 2015

Included larger sites as proposed in #97

97dede6

FlorianRappl self-assigned this Jun 18, 2015

FlorianRappl modified the milestones: v1.0, v0.9 Aug 25, 2015

FlorianRappl added the performance label May 26, 2016

FlorianRappl mentioned this issue Jul 11, 2016

Use AngleSharp For Strip Html - Performance #366

Closed

FlorianRappl modified the milestones: v1.0, vNext Mar 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast parsing and selector query #97

Fast parsing and selector query #97

VoidMonk commented Jun 17, 2015

FlorianRappl commented Jun 17, 2015

VoidMonk commented Jun 18, 2015

FlorianRappl commented Jun 18, 2015

FlorianRappl commented Jun 18, 2015

VoidMonk commented Jun 19, 2015

Fast parsing and selector query #97

Fast parsing and selector query #97

Comments

VoidMonk commented Jun 17, 2015

FlorianRappl commented Jun 17, 2015

VoidMonk commented Jun 18, 2015

FlorianRappl commented Jun 18, 2015

FlorianRappl commented Jun 18, 2015

VoidMonk commented Jun 19, 2015