New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite Vector (now "radix-balanced finger tree vectors"), for performance #8534
Conversation
Wow, @Ichoran what are your thoughts on this? |
Wow, this looks fantastic, Stefan! I don't have time right now to read through all the code, but the bits that I sampled all look reasonable (fairly repetitive, but that's expected given the nature of the data structure). The algorithmic design seems sound in concept; I'll try it on collection-laws when I get a chance to see how it holds up in practice. Anyway, this is great! Maybe we could introduce it in a separate dependency in the meantime so we don't have to wait for 2.14 to get it? |
This looks really impressive, Stefan! Can we swap the implementations in the 2.13 series? If not, make it the standard for 3.0 maybe? |
* the prefix). The level is increased/decreased when the affected side plus main data is already full/empty | ||
* - All arrays are left-aligned and truncated | ||
* | ||
* In addition to the data slices (`prefix`, `prefix2`, ..., `dataN`, ..., `prefix2`, `prefix1`) we store a running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reference we may cite, regarding the design and implementation of radix-balanced finger trees? For instance, is this ICFP article closely related?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned above I am not aware of any pre-existing references. The Bolivar paper (I found a PDF here) covers RRB trees, like the two papers I referenced in the PR description. My implementation uses non-relaxed trees. Relaxed trees can greatly improve insertion and concatenation performance but will show worse results for indexed access. I think it would be interesting to see if RBF trees can be generalized to RRBF trees. I expect it to be possible in principle but with higher implementation complexity than for a relaxed version of normal RB trees.
The "Off-Tree Tail" and "Left Shadow" optimizations from section 2.3 are worth mentioning in this context. The fingers are a generalization of an off-tree tail. While the off-tree tail is only the last one-dimensional slice, a finger extends this to all dimensions from 1 to n-1 for an n-dimensional vector. The left shadow is implemented by the old Vector. The finger on the left now provides the same advantages with a more compact memory representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be:
(`prefix`, `prefix2`, ..., `dataN`, ..., `suffix2`, `suffix1`)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it should be (`prefix1`, `prefix2`, ..., `dataN`, ..., `suffix2`, `suffix1`)
. I got 3 of 5 names wrong...
Amazing results! If |
Very impressive! 🚀 |
We ought to warm up the benchmarks in a way to ensure we measure the megamorphic dispatch of callsites that range over Right now, I see inlining logs like:
A quick and dirty way to see the effect is to disable JMH's forking and just run all experiments in its control JVM.
There are a few techniques to harden the benchmark against this problem. If the only virtual callsites you want to protect are all in the benchmark code itself, you can indirect them through a private method that bears the If you are benchmarking calls into library code that contain the virtual call sites, you need to warm them up with various subclasses (in this case, various sizes of collection). JMH has a command line option that lets you specify a set of benchmark methods to run during warmup phase. Adding this to our
Shows that:
Warms up all sizes in the same JVM before running the measurement for a particular size. However, it will use a fresh JVM for each size, which means Perhaps the simplest option for now would be some code in the |
* - prefix1 and suffix1 are never empty | ||
* - Balancing does not cross the main data array (i.e. prepending never touches the suffix and appending never touches | ||
* the prefix). The level is increased/decreased when the affected side plus main data is already full/empty | ||
* - All arrays are left-aligned and truncated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to encode some of these invariants in a checkConsistent
method which we could call incorporate into a Scalacheck test suite that performed random operations and asserts the internal structure is consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are validation methods in the JUnit test. I originally had them directly in NVector
but factored them out to keep the code size down.
case that: NVector[_] => false | ||
case o => super.equals(o) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep optimized implementations of equals
/ sameElements
for Vector
/ Vector
comparisons short circuiting when the lengths differ and avoiding element comparison when this
and that
structurally share a backing array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old optimizations don't make sense here. The only case where you would benefit from the shortcut is when comparing the result of map(identity)
. If the vectors were built separately, they won't share any arrays. Doing structural comparisons on each array separately might be useful. This could cover cases such as dropping the head element and then consing it back on. The implementation would be rather complex and you need to take care not to compromise the more common cases though. You cannot handle all arrays individually because equal vectors can be balanced differently. In the old Vector the only possible difference was the offset but now we have to deal with all the different prefix lengths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szeiger Would (NVector.tabulate(1000)(i => i)).equals(NVector.tabulate(999)(i => i))
short circuit based on a cheap length call, as per Vector
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this uses the default implementation of sameElements
which checks knownSize
first (and the size of an IndexedSeq
is always known).
I ran some new benchmarks with megamorphic dispatch. The results (at http://szeiger.de/tmp/vector2mega/vector-results.html) unfortunately mean that the new vector is not always faster which makes the argument for it much harder.
There are positive effects, too. Well, positive for my new implementation, because some operations in the old vector also suffer from the varying sizes. It can't be megamorphic method dispatch. My guess is that previously the same branches were taken and got highly optimized. Now we get more variation which has huge performance effects. For more complex algorithms this benefits the new implementation. Once the initial impact of the non-inlined call site has been absorbed, the rest is still specific to a certain dimension. The effect can be seen in sequential updates (last diagram on the page). The old vector used to be around 150μs independent of size. Now it starts at this level for small sizes but then appears to be growing logarithmically (which it really shouldn't, given the algorithm). Maybe @retronym has a better idea about what's going on. ( |
I believe you could fix this particular slowdown by using a field in the base class to store
Yes, benchmarking with polluted branch profiles (for the JIT or even the CPU branch predictor) is yet another best practice. If we want to dig in to this we could start by using JITWatch to compare. |
Are there any benchmarks for allocations and retained memory usage. In large apps this is as much a problem as cpu usage. CPU improvements look good |
Yes, I had the same idea and I already did that yesterday and ran some benchmarks over night. The results look promising: http://szeiger.de/tmp/vector2mega2/vector-results.html. Unfortunately
I didn't run any benchmarks for allocations. My expectation is that bulk operations may do some more one-time allocations (e.g. in Any operation on the prefix or suffix ( Retained memory for small vectors should be much better than before and start to even out as instances get larger. Constant overhead is a bit higher in dimensions 5 and 6 but the small advantage for the old implementation only holds as long as the vectors are left-aligned and right-truncated (i.e. they were built by appending individual elements, not with a builder), otherwise the small constant savings are dwarfed by the padding overhead. Overall, retained memory should be very close to optimal for
|
I pushed an update with the |
Interesting result: Using |
I benchmarked some alternatives for Of course, what this also means is that the iterator performance is very dependent on warmup and getting results within a few percent of the old implementation is probably fine. |
@szeiger did you ever try a version of this new structure which does not split out implementations based on depth? i.e., just have all fields present, but perhaps |
Having eveything in one class would add a lot of memory overhead for small instances, even more than the old vector (but with the truncated arrays we should still come up ahead in the end). After seeing the effects of branch prediction in the benchmarks for the old Vector, I doubt that it would be beneficial for performance. If you have to switch anyway in the critical code paths you may as well make HotSpot do it. I'll see if I can run some tests for this. It shouldn't be that hard to try it for a subset of operations. |
Here's a single-class implementation: https://github.com/szeiger/scala/blob/wip/new-vector-one-class/src/library/scala/collection/immutable/NVector.scala. I started the benchmarking for |
Have you tried benchmarking with graal? Might be useful for future-proofing the work done here. |
Another way to factor this out would be to keep all the subclasses for the data, but move the all the implementations up into the base class. Those implementations would look a lot like the ones we have to day: switch initially on something to discriminate the size, then cast to the subclass to get out the data. I would guess that the fastest way would be to have a This is all getting close to the trickery from: The warnings and conclusions at the end of that article are worth reading. As @smarter cautions, we don't want to overfit to C2 without also trying Graal. |
One small improvement would be to share a field for I've also pushed a commit that makes |
@szeiger What change did you make to enable pollute the profiles? I don't see any changes to |
Just a guess, but this could be due to the the new implementation using explicit bounds checking, whereas before This suggests yet another sort of profile pollution (😅 ) -- calling |
Here's a complete run with the latest Graal CE using mixed-size warmup: http://szeiger.de/tmp/vector2graalce/vector-results.html TL;DR: There are no big surprises that would change the general performance assessment. The problem with the 10% lower performance in There are some notable performance differences and in most cases they benefit the old implementation but these tend to be benchmarks where the new vector has a huge performance lead on HotSpot anyway so the lead becomes smaller but it's still much faster in the end.
It is also interesting to compare these results to the original ones with single-size warmup. Most of the cases where we see a significant speed-up in Graal ( |
and more cleanup after review comments and linter warnings
@dwijnand could you summarize the jardiff comparison you did for this PR? |
I ran subsets of the community build and compared its locally published jars with jardiff using Scala:
against 4 projects:
There's some unfortunate sbt-osgi/dbuild interaction noise going on in scala-parallel-collections, but outside of that and version string things, the jars were all identical. Which, in turn, means that any combination of already built library jars already and library jars rebuilt with 2.13.2 Scala would work with either old and new 2.13.x scala-library.jar. |
scala.collection.immutable.Vector
with a completely different data structure ("radix-balanced finger tree vectors"), for performance
If no objections are raised, I'll merge this PR in the next 1-2 days. |
scala.collection.immutable.Vector
with a completely different data structure ("radix-balanced finger tree vectors"), for performance
Has anyone tried to use the builder to create a vector of the maximum size? |
@cbfiddle I would file an issue if you validate its an actual problem. |
It appears there is another size limitation. When a Vector is created by repeated use of "appended", the prefix2 through prefix5 arrays are limited to containing 30 elements, even though the data structure and the builder permit 31. That is because these prefix arrays are initialized from the previous data array, which is limited to 30 elements, and never extended. An analogous limitation applies to prepended. The result is that appended/prepended throws an exception before the advertised maximum size = Int.MaxValue (2^31-1) is reached [see LASTWIDTH]. |
@cbfiddle - The design of |
I haven't looked at the code, but I don't think this is possible. This would leave holes in the trie structure and mess up indexing. Are you sure? If you do "appended" 1025 times off an empty vector, then |
@Ichoran I don't know if you would call a comment documentation, but the definition of LASTWIDTH says that directly. I copied and pasted the text. The existence of LASTWIDTH is an indication that the author intended to support 2^31-1 elements, but it appears that intension was not carried all the way through. I would suggest looking at the code. When a prefix is passed to the constructor, a cumulative element count is also passed in, which is used for indexing, thus avoiding the problem you describe. It appears that indexing works all the way to 2^31-1, but construction does not. |
Why isn't |
This is a complete rewrite of the old
scala.collection.immutable.Vector
.I recommend the RRB tree papers by Bagwell/Rompf and Stucki as background. Like the old Vector, the new one does not use a "relaxed" radix-balanced tree but these papers also give a good overview of the non-relaxed design and specific optimizations for it in the old Vector implementation.
Unlike other data structures that have been proposed as replacements for
Vector
as the defaultimmutable.IndexedSeq
my goal was to create a proper replacement forVector
that would not make any performance trade-offs.Design
My design differs from the old one in one important aspect: Instead of a single movable finger (the "display" pointers) it has a fixed finger at each end. These fingers are part of the proper data structure, not a transient optimization. There are no other ways to get to the data in the fingers (it is not part of the main data array), no focus point, and no stabilization. As a side-effect, the data structure is completely immutable and no extra memory fences are required. I am not aware of existing descriptions or implementations of this data structure. Based on related structures "radix-balanced finger tree" seems to be an appropriate name.
A second major difference lies in the implementation. Whereas the old Vector uses a single class for collections of all supported sizes, the new one is split into
Vector0
toVector6
for the different dimensions of the main data array.Vector0
is a singleton object for the empty vector,Vector1
(used for collections up to size 32) is essentially the same asArraySeq
, all higher dimensions are finger trees.I prefer to think of the nested arrays containing the data as hypercubes instead of trees. For a full
VectorN
we start with an n-dimensional hypercubedata_n
and slice off an n-1-dimensionalprefix_n-1
andsuffix_n-1
. We continue to slice off at the beginning of the prefix and the end of the suffix until we reachprefix1
andsuffix1
. The length of each side of the initial hypercube is 32 (same as the old Vector). After slicing, the highest dimension ofdata
can have a length up to 30,prefix1
andsuffix1
can be up to 32, and the intermediate prefixes and suffixes up to 31. When the original data does not fill the entire hypercube, there may be empty parts both at the beginning (the offset) and the end. When the offset is 0 we call the vector "left-aligned". Notice that the slicing ensures that only the highest dimension of each data array can be incomplete. All lower dimensions always have a length of 32 and are filled entirely.Balancing requires that
prefix1
andsuffix1
are both non-empty. There is no obvious advantage from doing the same for higher-dimensional prefixes and suffixes; it would actually make appending and prepending more expensive. They only get rebalanced when it is necessary forprefix1
orsuffix1
. There is no requirement that the main data array be non-empty, either. For example, when appending an element to a a completely filled and left-alignedVector3
, the existingprefix1
,prefix2
anddata3
becomeprefix1
,prefix2
andprefix3
of a newVector4
.data4
,suffix3
andsuffix2
remain empty andsuffix1
contains the new element.Building is done with a single
NVectorBuilder
for all dimensions. It can be initialized with an existing vector structure for more efficient concatenation. I did not get around to implementing an optimized concatenation where the right-hand side vector determines the alignment. Like most other algorithms for this data structure it is quite easy to visualize how it works and assure yourself that it is possible but actually implementing it correctly is much harder. This would allow concatenation of two vectors of sizes m and n in O(min(n, m) + log(n+m)) time.There is another builder class
NVectorSliceBuilder
for non-trivial slicing operations. The hypercube fingers are very well suited for slicing because any slicing will automatically result in array slices with dimensions in the correct order for a new vector. Only minor rebalancing is required in some cases.Memory Usage
The old Vector has a huge overhead for small instances. In addition to the actual data arrays, a
Vector
object adds 56 bytes (in 64-bit HotSpot with compressed pointers). By having different classes for different dimensions, this goes down to 16 bytes for anNVector1
, the same as animmutable.ArraySeq
. The overhead grows up to 80 bytes for anNVector6
. The break-even is atNVector4
. Constant overhead for these larger sizes should be inconsequential. It could be reduced by not storing the intermediate prefix lengths, at the expense of slower access to the prefix.Additionally, all instances built by
VectorBuilder
have their data arrays padded to 32, which constitutes an extra 124 byte overhead for aVector
of size 1 that was built with aVectorBuilder
.NVectorBuilder
always truncates the arrays to avoid this.Benchmarking
Benchmarks were run with
while the initialization of the unnecessary part (
v
ornv
) was commented out. If you have more than 16G of memory you should be able to choose a larger heap size and run everything in one go.Interactive diagrams can be found at http://szeiger.de/tmp/vector2/vector-results.html. All diagrams plot time against number of elements, both on logarithmic scales. The diagrams on that page are interactive so you can explore the exact numbers for all data points. Due to the double logarithmic scales and the huge range of instance sizes, even big differences can appear very small in the diagrams.
The diagram generation code is currently part of this draft PR. We may want to split it off into a separate project for more general use or just keep it here because it's only a small amount of code.
Performance
We can see small improvements for sequential
apply
and very similar results for randomapply
. Other than micro-optimizations, there isn't much that can be done here:Updating with random indexes shows the expected logarithmic growth but the new implementation is several times faster due to lower overhead:
Sequential updates are the Achilles heel of the radix-balanced finger trees, at least in theory. The movable finger of the old Vector allows sequential updates in what appears to be amortized constant time (i.e. an update positions the finger at the updated position and then further updates in the vicinity can access the location more directly). Due to index computations being based on integers which have to grow logarithmically, this is not really the case but as long as all indexes are represented by the same 32-bit
Int
type, the performance graph is essentially flat. The new Vector cannot do this optimization and therefore the logarithmic growth is directly visible when stepping up in vector dimensions. Fortunately, it is so much faster to begin with and the base-32 logarithm is so slow that it still wins at larger sizes:Sequential appending and prepending is efficient and has low overhead in the new Vector. The old Vector also makes it efficient by positioning the finger appropriately but the overhead is much larger so that the new implementation ends up being several times faster:
Alternating between appending and prepending is a worst-case scenario for the old Vector because the finger always ends up in the wrong location (i.e. you get all the overhead but none of the benefit). The new implementation is about 35 to 40 times faster in this case:
All bulk-appends benefit from initializing an
NVectorBuilder
with the existing left-hand-sideNVector
, plus other optimizations. Appending anArraySeq
of the same size is several times faster:This is even more apparent when we repeatedly append a collection only 1/10 the size where the new implementations is about 25 to 50 times faster in non-trivial cases:
Both implementations have optimizations for small bulk-appends. Below a certain size limit for the collection being appended, individual appends are performed instead of using a builder. The new implementation optimizes this further: When the data to be appended fits completely into
suffix1
(which allows for appends up to 31 elements depending on alignment), it is done as a bulk copy into the array without the need to create either a builder or intermediate vectors. As a result, appending a collection of size 2 is about 8 to 10 times faster:When concatenating two vectors, we can make use of the fact that both are built from array slices. We should explore generalizing this to other array-based collections in the future but this will require new abstractions in the collections library. For now the optimization is limited to
NVector
. Instead of appending the right-hand side element by element to theNVectorBuilder
, we can append a whole slice of 32 elements with (depending on alignment) one or twoSystem.arraycopy
calls. This is another 2 to 3 times faster than appending individual elements:Iterating with
foreach
anditerator
is pretty much the same in both implementations, with some wins for better optimization for small instance sizes in the new one:With the balancing requirement of non-empty
prefix1
andsuffix1
, bothhead
andlast
are only one array indexing operation away in the new implementation. The old Vector still does well onhead
because the instances for the benchmark were built with aVectorBuilder
(which always results in the focus being on the first element) but shows logarithmic growth onlast
, where the new implementation is still O(1):tail
is 8 to 10 times faster in the new Vector. The rebalancing is amortized constant time but it gets a bit slower for higher dimensions because the overhead for managing the instances grows:Building with a
Builder
is a bit faster for most instance sizes in the new implementation. The old one wins for very small instances because it doesn't truncate the arrays and therefore creates a huge constant overhead for small vectors (see above):map
is essentially unoptimized in the old implementation. It uses the default implementation based onIterator
+Builder
which is decently optimized and good in terms of big-O-complexity but we can still do much better by rebuilding the existing structure:The new implementation also has map-conserve semantics as an internal optimization. While a
map
call always results in a newNVector
instance, the internal arrays (at all dimensions) are shared if none of the entries change. Since arrays are treated as fully immutable there are no ownership issues. I opted for a conservative implementation which is essentially free for the non-conserving case (i.e. there is no observable performance penalty when mapping everything to a new object). The performance improvements formap(identity)
are still decent (but it could be about twice as fast with an dedicatedmapConserve
implementation):