Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go: reader: make ordered iteration faster #1168

Merged
merged 37 commits into from
May 14, 2024
Merged

Conversation

james-rms
Copy link
Collaborator

@james-rms james-rms commented May 7, 2024

Changelog

  • Added Message.PopulateFrom([]byte) error method to Message, which allows readers to avoid reallocating Message structs between messages.
  • Added MessageIterator.Next2(*Message) (*Schema, *Channel, *Message, error) method to the MessageIterator interface. This method allows the caller to re-use memory between message reads, and to avoid allocating a new Message on the heap for every message read.
  • optimized indexed reading to read much faster and use much less memory.

Docs

  • Update docstrings
  • Update examples

Description

This PR makes message iteration much faster. The included benchmark shows a pretty significant speedup:

Before

Note: the comparison benchmark based on main uses MessageIterator.Next() to read messages.

$ go test . -run=^$ -bench=BenchmarkReader -benchmem -benchtime=10x 
goos: darwin
goarch: arm64
pkg: github.com/foxglove/mcap/go/mcap
BenchmarkReader/inorder/no_index-8                    10         532682221 ns/op         238.83 MB/s       7826155 msg/s        533306349 B/op   8054975 allocs/op
BenchmarkReader/inorder/index_file_order-8            10        1886590288 ns/op          67.29 MB/s       2204813 msg/s        909884028 B/op  12114339 allocs/op
BenchmarkReader/inorder/index_time_order-8            10        2248067917 ns/op          54.39 MB/s       1782145 msg/s        909889379 B/op  12114382 allocs/op
BenchmarkReader/inorder/index_rev_order-8             10        2324738488 ns/op          48.90 MB/s       1602346 msg/s        910261216 B/op  12114355 allocs/op
BenchmarkReader/inorder/bare_lexer-8                  10         196806788 ns/op         660.42 MB/s      21640757 msg/s        17005039 B/op       4672 allocs/op
BenchmarkReader/minor/no_index-8                      10         509497992 ns/op         241.92 MB/s       7927082 msg/s        531637254 B/op   8054932 allocs/op
BenchmarkReader/minor/index_file_order-8              10        1837735883 ns/op          66.97 MB/s       2194637 msg/s        909846889 B/op  12114373 allocs/op
BenchmarkReader/minor/index_time_order-8              10        2250390946 ns/op          54.82 MB/s       1796497 msg/s        909844632 B/op  12114332 allocs/op
BenchmarkReader/minor/index_rev_order-8               10        2360883250 ns/op          53.23 MB/s       1744292 msg/s        910212621 B/op  12114308 allocs/op
BenchmarkReader/minor/bare_lexer-8                    10         195830417 ns/op         638.20 MB/s      20912477 msg/s        15341232 B/op       4655 allocs/op
BenchmarkReader/major/no_index-8                      10         510946658 ns/op         241.74 MB/s       7921189 msg/s        532934768 B/op   8054945 allocs/op
BenchmarkReader/major/index_file_order-8              10        1841807000 ns/op          66.35 MB/s       2174050 msg/s        909833931 B/op  12114348 allocs/op
BenchmarkReader/major/index_time_order-8              10        2247866758 ns/op          54.53 MB/s       1786987 msg/s        909836941 B/op  12114379 allocs/op
BenchmarkReader/major/index_rev_order-8               10        2328824133 ns/op          51.12 MB/s       1675101 msg/s        910215724 B/op  12114370 allocs/op
BenchmarkReader/major/bare_lexer-8                    10         198086167 ns/op         632.44 MB/s      20724011 msg/s        16635893 B/op       4661 allocs/op
PASS
ok      github.com/foxglove/mcap/go/mcap        248.060s

After

% go test . -run=^$ -bench=BenchmarkReader -benchmem -benchtime=10x 
goos: darwin
goarch: arm64
pkg: github.com/foxglove/mcap/go/mcap
BenchmarkReader/inorder/no_index-8                    10         209814421 ns/op         596.62 MB/s      19550071 msg/s        17491784 B/op       6310 allocs/op
BenchmarkReader/inorder/index_file_order-8            10         340508775 ns/op         360.08 MB/s      11799088 msg/s        10446040 B/op      16981 allocs/op
BenchmarkReader/inorder/index_time_order-8            10         341343088 ns/op         359.00 MB/s      11763672 msg/s        10443932 B/op      16955 allocs/op
BenchmarkReader/inorder/index_rev_order-8             10         348526088 ns/op         356.74 MB/s      11689775 msg/s         9996309 B/op      16964 allocs/op
BenchmarkReader/inorder/bare_lexer-8                  10         187405846 ns/op         664.57 MB/s      21776806 msg/s        17439823 B/op       4674 allocs/op
BenchmarkReader/minor/no_index-8                      10         211110267 ns/op         587.06 MB/s      19236916 msg/s        16522652 B/op       6284 allocs/op
BenchmarkReader/minor/index_file_order-8              10         356903517 ns/op         336.77 MB/s      11035283 msg/s        10419253 B/op      17006 allocs/op
BenchmarkReader/minor/index_time_order-8              10         552369746 ns/op         218.44 MB/s       7157955 msg/s        10444996 B/op      17744 allocs/op
BenchmarkReader/minor/index_rev_order-8               10         555191658 ns/op         220.27 MB/s       7217936 msg/s         9971279 B/op      17665 allocs/op
BenchmarkReader/minor/bare_lexer-8                    10         194812112 ns/op         554.57 MB/s      18172347 msg/s        16473271 B/op       4670 allocs/op
BenchmarkReader/major/no_index-8                      10         211406192 ns/op         579.88 MB/s      19001450 msg/s        17365727 B/op       6291 allocs/op
BenchmarkReader/major/index_file_order-8              10         354124750 ns/op         347.12 MB/s      11374355 msg/s        10418725 B/op      16979 allocs/op
BenchmarkReader/major/index_time_order-8              10         566783688 ns/op         215.38 MB/s       7057431 msg/s        16452847 B/op      17690 allocs/op
BenchmarkReader/major/index_rev_order-8               10         563155871 ns/op         218.15 MB/s       7148236 msg/s        15986112 B/op      17699 allocs/op
BenchmarkReader/major/bare_lexer-8                    10         195610721 ns/op         633.25 MB/s      20750327 msg/s        17316992 B/op       4672 allocs/op
PASS
ok      github.com/foxglove/mcap/go/mcap        68.716s

For the unindexed message iterator, all of the difference comes from:

  • being able to re-use a Message struct between calls to Next()
  • switching from storing channels and schemas in maps to slices.
  • using and re-using an internal buffer for lexing MCAP records.

For the indexed message iterator, we do all of the same things plus:

  • We maintain a pool of decompressed chunk buffers, which are re-used after all of the messages from a given chunk are read out.
  • We no longer read message indexes from the file, choosing instead to read the chunk content and build our own message index in memory. This allows us to read files written without message indexes in order, and also reduces I/O, which in some cases is probably faster (slow network connections with large message index overheads)
  • we no longer use a heap to maintain order, instead we maintain a sorted array of unread message indexes. Every time we encounter a new chunk:
    • if this chunk does not overlap with the last, clear the message index array and write the new chunk's messages in. If they are already in order, do not sort. This makes ordered iteration over an in-order MCAP as fast as unordered iteration.
    • if the new chunk's messages are not in order, sort the new chunk's messages.
    • if the new chunk overlaps with the last, append the new chunk's messages, and sort all unread messages.

New API justification

The issues with Next(p []byte) (*Schema, *Channel, *Message, error) that caused me to explore alternatives are:

  • The buffer p is used to store the message data, but if it isn't big enough, a new buffer is allocated for every message. A common idiom is to return the newly allocated buffer so that the caller can take ownership of it and re-use it, but the current API doesn't allow that.
  • A new Message struct is allocated on the heap for every iteration. Even if the message goes out of scope on the next loop, this still causes significant work for the garbage collector to do.

This new function signature re-uses the message passed in, if one is passed in. If nil is used, it creates a new Message.

go/mcap/range_index_heap.go Outdated Show resolved Hide resolved
go/mcap/reader.go Outdated Show resolved Hide resolved
@james-rms
Copy link
Collaborator Author

cc. @wkalt

@wkalt
Copy link
Contributor

wkalt commented May 7, 2024

Do you have any thought on simplifying this to,

ReadNext(msg *message.Message) error

and making the reuse of the message's data buffer implicit? So the iterator will allocate a new buffer if it needs, thus resizing for the subsequent call. I would imagine that is no different than what we are effectively telling the user to do here.

Secondly, can you quantify the cost of the schema/channel lookups relative to the data allocations (and heap iterations IIRC) and share how much the relative contribution of each is hurting us? My recollection is that the main costs are in allocations and in the heap, rather than in the lookups. The lookups can also be converted to slice lookups pretty easily, either with a dynamically sized slice or (worst case) the full set of 2^16 which would only take 512kb. I would guess the cost of the lookups if that were done should go to approximately zero.

Third, could you post comparisons of the iteration performance of the reader vs the lexer? IIRC the lexer is much faster, like maybe 5x on adrian's mcnasty.mcap file. I think the goal of the reader improvements should roughly be to match the performance of the lexer, so it would be useful to have that as a comparison.

Fourth, have you investigated whether it is possible to bypass the heap completely in the case of a fully-ordered chunk (which can be known from inspection of the index)? My hunch previously was that some significant savings may come from here.

Great work overall, will take a closer look/test it out shortly.

@wkalt
Copy link
Contributor

wkalt commented May 7, 2024

@james-rms I'm not sure if it's useful at all but this is the code I was working on when I was digging into this previously: https://github.com/wkalt/mcap/blob/task/golang-reader-perf-improvements/go/mcap/writer_test.go#L703

The test linked there will generate a sample file. This benchmark is the one I guess I was then using for comparisons:
https://github.com/wkalt/mcap/blob/task/golang-reader-perf-improvements/go/mcap/reader_test.go#L749

I don't remember unfortunately where I left things off, except that I broke the code and then didn't get it working again. It looks to me like the ideas that patch is implementing are, a) switch to a slice-based lookup for the channels/schemas, and b) allow the user to pre-allocate structures (I doubt preallocating schema/channel actually matters at all, since those are comparatively infrequent).

My overall reaction to this PR is that I suspect we can probably maintain something closer to a traditional "iterator" interface (modulo whatever is needed for message preallocation -- i.e still returning our schema/channel pointers) while still getting all the performance improvements we want, so that in the 99% case where files are ordered, the lexer and the reader are doing identical work and are equally fast. I'm not wild about the interface changes required in this specific patch. I'm not really a fan of SchemaAndChannelForID - I think it just pushes the work our code did previously onto the user and probably results in the same amount of aggregate lookups being done in actual use (possibly with the user maintaining their own, duplicate mappings). If feasible I would definitely prefer if we could keep returning those pointers and instead make this efficient by changing the backing structures to slices.

@wkalt
Copy link
Contributor

wkalt commented May 7, 2024

I guess my concrete thought is, can we make this the Next signature:

Next(*Message) (*Schema, *Channel, *Message, error)

and get performance identical to the lexer, with that being the only required new interface. I suspect the answer is yes -- and if so, that seems like a better outcome.

@wkalt
Copy link
Contributor

wkalt commented May 7, 2024

A rough idea for an approach might be, to first examine the chunk indexes to see if the chunks are ordered and if none overlap. 99%+ of the time I expect, this will be the case.

If that is the case, there is no heap necessary. The chunks are just consulted in order. For each chunk, look at the message index to see if there is a disordering. If there isn't (99% of the case), stream-decompress the chunk directly, like the lexer does, meaning there is no need to even materialize the decompressed chunk.

If there is a disordering, then sort the index (possibly using insertion sort if it matters -- it'll probably be almost-totally ordered already), decompress the chunk, and return messages in sorted order.

If there are overlaps between the chunks, then dispatch to the current iterator logic. I think it is probably still possible to beat the current iterator without a heap (generally not all chunks will overlap, so the strategy above can be used for most of them), but my assumption would be that overlapping chunks are so rare that leaving this unoptimized for now will be totally fine.

@james-rms
Copy link
Collaborator Author

james-rms commented May 8, 2024

Regarding the signature - i've pushed an update which makes the API ReadInto(*Message) (*Schema, *Channel, error). i can't see a point returning the message pointer if it's always going to be the same as the passed-in pointer.

** EDIT ** Previously this comment explained a bunch about how the cost of looking up schema and channel was too great, but I found and removed a duplicate lookup and the difference is so minor i'm now convinced.

@james-rms james-rms force-pushed the jrms/faster-message-iterator branch from 331438f to 326a0b9 Compare May 8, 2024 04:51
@james-rms
Copy link
Collaborator Author

Also re: testing against the lexer: I have a rough estimate here:

BenchmarkReader/lexer-8                    10         175206304 ns/op         173.64 MB/s       5689735 msg/s        17305244 B/op       1260 allocs/op

@james-rms james-rms requested a review from wkalt May 8, 2024 05:41
@wkalt
Copy link
Contributor

wkalt commented May 8, 2024

@james-rms I think there is definitely value in returning the message pointer because some users may prefer to allocate on each call, i.e by passing in a nil as the input pointer.

This would be useful if they want to pass the messages through some kind of processing pipeline and aren't too concerned about maximal perf. I'm assuming that in a future version we'll want to deprecate & remove the old interface and push users toward this one.

@wkalt
Copy link
Contributor

wkalt commented May 8, 2024

Is that benchmark you posted for the lexer comparable to the other two posted earlier? That seems like a great basis for comparison

@wkalt
Copy link
Contributor

wkalt commented May 13, 2024

Final minor suggestion - I think we should change "major" and "minor" in these benchmark names to "major_overlap" and "minor_overlap" for clarity.

Anyway, final results from me:

BenchmarkReader/inorder/no_index-32              100     210357017 ns/op         609.76 MB/s      19980710 msg/s        18784082 B/op       6300 allocs/op
BenchmarkReader/inorder/index_file_order-32          129         180875585 ns/op         677.58 MB/s      22202914 msg/s        10451106 B/op      16981 allocs/op
BenchmarkReader/inorder/index_time_order-32      130     180912838 ns/op         675.79 MB/s      22144266 msg/s        10451253 B/op      16989 allocs/op
BenchmarkReader/inorder/index_rev_order-32           128         185554867 ns/op         647.86 MB/s      21229217 msg/s         9993184 B/op      16985 allocs/op
BenchmarkReader/inorder/bare_lexer-32                122         192062239 ns/op         600.44 MB/s      19675251 msg/s        18732559 B/op       4671 allocs/op
BenchmarkReader/minor/no_index-32                    100         215605448 ns/op         499.60 MB/s      16371129 msg/s        15390843 B/op       6275 allocs/op
BenchmarkReader/minor/index_file_order-32            132         178575332 ns/op         673.93 MB/s      22083281 msg/s        10418086 B/op      16978 allocs/op
BenchmarkReader/minor/index_time_order-32             68         362007549 ns/op         347.44 MB/s      11384992 msg/s        10447251 B/op      17721 allocs/op
BenchmarkReader/minor/index_rev_order-32              69         355197195 ns/op         353.53 MB/s      11584413 msg/s         9977055 B/op      17704 allocs/op
BenchmarkReader/minor/bare_lexer-32                  124         197898411 ns/op         586.82 MB/s      19229078 msg/s        15341271 B/op       4651 allocs/op
BenchmarkReader/major/no_index-32                    100         228166159 ns/op         550.78 MB/s      18048103 msg/s        16619558 B/op       6283 allocs/op
BenchmarkReader/major/index_file_order-32            132         176790090 ns/op         671.87 MB/s      22016043 msg/s        10418646 B/op      16992 allocs/op
BenchmarkReader/major/index_time_order-32             63         371470199 ns/op         342.28 MB/s      11215845 msg/s        16457851 B/op      17714 allocs/op
BenchmarkReader/major/index_rev_order-32              66         356158503 ns/op         356.68 MB/s      11687706 msg/s        15990102 B/op      17716 allocs/op
BenchmarkReader/major/bare_lexer-32                  120         198373003 ns/op         633.73 MB/s      20766082 msg/s        16569771 B/op       4656 allocs/op
PASS
ok      github.com/foxglove/mcap/go/mcap        513.643s

Behavior is close on files with small overlap, and we still have a gap with heavy overlap as expected. Looks great!

@defunctzombie
Copy link
Contributor

I wonder if there's a good reason to avoid that

In golang all dependencies of a project have to use the same version? So if you make API incompatible changes then some dependency that also needs the mcap library will break if not updated to the latest? This is why golang uses the folder major versioning approach iirc? I have not done enough golang to say if its better to roll the major or adopt the 2 strategy. Can you point to examples of popular libraries that have added 2 to their APIs rather than rolling a major? Or some other approach?

@wkalt
Copy link
Contributor

wkalt commented May 13, 2024

@defunctzombie I'm not sure I'm quite following - hopefully this will be helpful -

  • All dependencies of a project in go do not need to use the same version. They can use whatever tagged version they desire.
  • Dependency updates are opt-in, so shipping a new version does not force any upgrade on users.
  • There is no distinction between releasing a major version of the lib and adopting the /v2 suffix on import paths. The way to release a major would be to bump the major version in the tag, and do the /v2 suffix. So it's not a "rather than" - you'd do both.
  • There are instructions here: https://go.dev/wiki/Modules#releasing-modules-v2-or-higher . Of the two approaches mentioned, you want to use the first one. The second approach was only relevant during a transitory period during adoption of modules, for projects that needed to support older versions of go.
  • An example of a go project on a version greater than 1 is https://github.com/pierrec/lz4
  • Because MCAP has released a 1.y.z already, and because you guys like semver, you don't have a choice about the v2 suffixing. If you bump the tag without suffixing the import path, the new tag won't be consumable.

While it isn't an option for MCAP, numerous projects choose to ignore SIV (semantic import versioning, this suffixing stuff) for various reasons. The way to do this is to adopt a convention that isn't real semver, such as keeping a permanent 0 or 1 major version and communicating breaking changes in other ways. Foxglove has dependencies that take this approach as well. I don't recall which offhand (protobuf maybe?) but I expect you might find a few if you executed go get -u in the data platform project and run the tests. I generally have no expectation that my dependencies adhere to semver, whether or not they claim to, so I consider this fine as well.

The point of SIV is to allow consumers to use multiple major versions of a lib at the same time. That's not a requirement I've ever encountered so to me personally, it's an inconvenience to update paths, docs, etc just for a version bump.

@wkalt
Copy link
Contributor

wkalt commented May 13, 2024

I think a good argument against rolling a major update into this PR, is that in order to avoid update churn it would be best if the major were specifically planned so that any breaking changes we want to put in can be bundled together. Due in part to SIV, multiple major upgrades in a row is annoying for consumers.

@james-rms james-rms enabled auto-merge (squash) May 14, 2024 23:16
@james-rms james-rms merged commit 2473746 into main May 14, 2024
23 checks passed
@james-rms james-rms deleted the jrms/faster-message-iterator branch May 14, 2024 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants