Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Most efficient way to get matching element? #929

Open
derekantrican opened this issue Jan 13, 2021 · 5 comments
Open

Most efficient way to get matching element? #929

derekantrican opened this issue Jan 13, 2021 · 5 comments

Comments

@derekantrican
Copy link

I have an application that scrapes an entire website and it runs in about 12 hours. It uses WebClient.DownloadString to get the html, then uses HtmlParser.ParseDocument to parse it. Then, I do a lot of other parsing on top of that. This happens for about 293,000 pages so I'm trying to save any little bit of time that I can.

I've noticed that I've got a lot of places where I call IHtmlDocument.GetElementsByTagName(TAG).FirstOrDefault(QUERY_SELECTOR). I believe I could collapse this into some sort of IHtmlDocument.QuerySelector(QUERY_SELECTOR) which theoretically would speed up the time by returning after the first match, but some preliminary testing has shown QuerySelector to be slow vs the old method. For instance:

IElement element = doc.GetElementsByTagName("h2").FirstOrDefault(x => x.TextContent.Contains("Climbing Directory"));

takes about 1 ms, where

IElement element = doc.QuerySelector("h2:contains('Climbing Directory')");

takes about 23 ms.

Any suggestions for improving my code?

@derekantrican
Copy link
Author

All my parsing code is here if you have any tips for improving efficiency: https://github.com/derekantrican/MountainProject/blob/master/MountainProjectAPI/Functions/Parsers.cs

@derekantrican
Copy link
Author

Of course, with 12 hours for 293,000 items, maybe an average of 147ms per item is about as good as it can get

@FlorianRappl
Copy link
Contributor

I'm afraid I don't have a good answer (#584).

This certainly can / could be improved on the QuerySelector level. I'm not sure if the :contains is the villain here, or if the overall performance of the QuerySelector is in charge...

@santoro-mariano
Copy link

@derekantrican I know it will not improve anglesharp performance but have you tried to parallelize some of those foreachs calling Parallel.ForEach?

@derekantrican
Copy link
Author

derekantrican commented Nov 13, 2022

@santoro-mariano Yup. In the repro I linked earlier, that's used here: https://github.com/derekantrican/MountainProject/blob/master/MountainProjectDBBuilder/Program.cs#L286

I could try parallelizing more (in the Parsers file I linked above).

Since I originally posted this, the greatest improvement in speed has come from moving to .NET Core from .NET Framework. That pretty much cut the entire time in half!

@FlorianRappl FlorianRappl added this to the vNext milestone Jan 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants