Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StreamMap slow? #353

Open
lmcd opened this issue Aug 19, 2022 · 7 comments
Open

StreamMap slow? #353

lmcd opened this issue Aug 19, 2022 · 7 comments

Comments

@lmcd
Copy link

lmcd commented Aug 19, 2022

This is a WIP issue as I plan to investigate further.

I'm working on a swift-nio project that has an emphasis on large file transfers. Getting a stream of bytes out of the server is very fast, but uploading a stream is around 3x slower - even when the incoming buffers aren't handled/stored at all.

I've profiled in instruments and see a lot of traffic (around 15%) in StreamMap that's just trying to pick a struct out of a collection based on a stream ID. On first glance, it seems to me that we can probably do away with CircularBuffer, in favour of something more simple. I think the fact that stream IDs start at zero, and increment from there (but may arrive out of order), potentially opens up some optimisation opportunities.

Note that this performance bottleneck was noticed in debug builds. I'll report back with release build findings.

@Lukasa
Copy link
Contributor

Lukasa commented Aug 19, 2022

Thanks for filing this: I think the debug builds are vastly overestimating the cost of the StreamMap. CircularBuffer is an abstraction over Array, and so it provides extra layers of performance cost that are not optimised out in debug builds. In release builds, however, it vanishes.

I think the fact that stream IDs start at zero, and increment from there

Incidentally, this is why CircularBuffer is used: it allows the storage to be totally ordered, without needing to be frequently "compacted".

@Lukasa
Copy link
Contributor

Lukasa commented Aug 19, 2022

One note: we could replace CircularBuffer with Deque from Swift collections.

@lmcd
Copy link
Author

lmcd commented Aug 19, 2022

Just reading the header comment in StreamMap (which I should have done before posting this issue, sorry) - it notes that a tradeoff was made resulting in slower lookups, but it assumes number of streams is low. I think if you're totally blasting a server with many gigs/s in uploads over a fast connection, which is our use case, stream IDs per connection could actually wind up being quite substantial. - I think I may have misunderstood how HTTP/2 streams work here.

It also dismisses dictionaries, due to the computational load of computing hashes. But couldn't we just make HTTP2StreamID Hashable - and simply provide the underlying Int32 as the hashValue

@lmcd
Copy link
Author

lmcd commented Aug 19, 2022

Thanks for filing this: I think the debug builds are vastly overestimating the cost of the StreamMap. CircularBuffer is an abstraction over Array, and so it provides extra layers of performance cost that are not optimised out in debug builds. In release builds, however, it vanishes.

I've noticed release builds being way faster too, so maybe all of this is a non-issue. I'll check when I have time 👍

@Lukasa
Copy link
Contributor

Lukasa commented Aug 19, 2022

It also dismisses dictionaries, due to the computational load of computing hashes. But couldn't we just make HTTP2StreamID Hashable - and simply provide the underlying Int32 as the hashValue

This totally works, but the tradeoff isn't straightforward. For small dictionaries (in this case, fewer than 100 elements), it's cheaper to do a linear array scan to find the element than it is to hash the key and do a jump. This is fairly non-intuitive but comes down to the way branch predictors and cache preloaders work: the memory subsystem can observe and optimise this access pattern, meaning that we're never waiting for memory, in a way that it cannot do with hashing. The data structure is therefore optimised for small size. For larger sizes, we do binary search instead, which is still extremely cheap.

@lmcd
Copy link
Author

lmcd commented Aug 19, 2022

Thanks for the technical explanation! I'm glad a lot of thought went into this.

Diving a bit deeper, I'm seeing a lot of the work is actually coming from debugOnly calls - so yeah seems like a false alarm. Notably, setting _backingCheck. Now I'm wondering why _UInt24 is used, but that's a rabbit hole I should probably avoid for today!

@weissi
Copy link
Member

weissi commented Aug 20, 2022

Thanks for the technical explanation! I'm glad a lot of thought went into this.

Diving a bit deeper, I'm seeing a lot of the work is actually coming from debugOnly calls - so yeah seems like a false alarm. Notably, setting _backingCheck. Now I'm wondering why _UInt24 is used, but that's a rabbit hole I should probably avoid for today!

The _backingCheck is just doing some (as you noticed) debug-build-only checking that may help to catch bugs early. It uses _UInt24 because we happened to have 24 spare bits in the data structure. So I thought why not (in debug builds) make use of those bits and store some extra information that we can then verify all the time. If I had used a normal UInt32 then the data structure would've grown (in debug & release builds) and if I had used UInt16 then we'd still have wasted 8 bits.

Concretely, those 24 bits store the length of the buffer such that we can detect if some code illegally stores and reuses buffer offsets across resizes. That's illegal because (like other many other data structures) the actual indexes get invalidates if you add/remove elements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants