Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parsing speed for "small" URLs #48

Open
karwa opened this issue May 21, 2021 · 0 comments
Open

Improve parsing speed for "small" URLs #48

karwa opened this issue May 21, 2021 · 0 comments

Comments

@karwa
Copy link
Owner

karwa commented May 21, 2021

The URL parser can be viewed in 2 steps:

  1. Scan the URL string, make sure we can write it as a normalised URL string.
  2. Allocate storage, write the normalised URL string to it.

This leaves out one important consideration: we don't know the size of the URL string until we try to write it. It may be a lot larger than the input, if anything needs to be percent-encoded (percent-encoding triples the length of the string), or it may be lot smaller if it contains a complex path which can be simplified, some parts may come from the base URL, etc.

In order to provide scalable, predictable performance, the allocation & writing step above is broken down in to 2 sub-steps (see ParsedURLString.constructURLObject):

2a. Perform a dry-run, writing the URL using a StructureAndMetricsCollector, so we know the precise capacity required.
2b. Allocate storage with that capacity, and write again, this time for real.

This has the really nice property that parsing any URL, no matter how large or complex, results in only a single heap allocation (the result). The dry-run also stores a bunch of flags on the side which can make the second writing pass faster (e.g. whether we can skip percent-encoding for a particular component).

There are other approaches we could take: for example, we could just make a storage object, guess the required capacity, and keep append-ing to it. Unfortunately, this can lead to repeated copies and allocations, and in the worst case has quadratic complexity (that's why the standard library advises you to reserve capacity in advance if appending to Array/String in a loop).

The way Array and String solve this is by using a geometric growth strategy; this reserves a bunch of extra capacity on each reallocation to try to amortise the cost. It's based on the idea that if you're appending a lot of elements to your array, you're probably going to keep on appending a lot. While this makes sense for appending to a general-purpose type such as Array, which people use in all sorts of varied situations, I'm not convinced that it makes sense for parsing a URL string.

Another approach would be if the first writing pass (currently a dry-run) took a guess at the required storage capacity, at least for small URLs. If it's correct, great! We saved a second writing pass, and cut parsing time roughly in half. If it isn't, we wasted an allocation, but we can still continue calculating the actual required capacity. Then the second pass can make storage of the correct size, copy the partial result from the "guess storage", and resume writing from where the first pass ran out of space.

Profiling shows that writing the normalised URL accounts for ~80-90% of execution time, split evenly between the dry-run and writing to storage. If small URLs can avoid the second pass, parsing them could be up to 2x as fast as it is today. Large URLs will likely get a bit slower due to the extra allocation, but they'll also be able to skip a lot more on the second pass, so I expect it will be much less than 2x slower (maybe 1.3x-1.5x, at a guess). So that sounds like an overall win.

And when I say "small", I'm talking <200 bytes. For comparison, this reddit URL is 153 bytes. And just by looking at it, it's longer than most of the URLs you're likely to parse in typical application scenarios. So the vast majority of URLs should see a big speed increase.

https://www.reddit.com/r/mildlyinteresting/comments/lvns4s/this_imported_salmon_so_tightly_wrapped_in/gpcw2uu?utm_source=share&utm_medium=web2x&context=3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant