Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite Vector (now "radix-balanced finger tree vectors"), for performance #8534

Merged
merged 7 commits into from Mar 18, 2020

Conversation

szeiger
Copy link
Member

@szeiger szeiger commented Nov 10, 2019

This is a complete rewrite of the old scala.collection.immutable.Vector.

I recommend the RRB tree papers by Bagwell/Rompf and Stucki as background. Like the old Vector, the new one does not use a "relaxed" radix-balanced tree but these papers also give a good overview of the non-relaxed design and specific optimizations for it in the old Vector implementation.

Unlike other data structures that have been proposed as replacements for Vector as the default immutable.IndexedSeq my goal was to create a proper replacement for Vector that would not make any performance trade-offs.

Design

My design differs from the old one in one important aspect: Instead of a single movable finger (the "display" pointers) it has a fixed finger at each end. These fingers are part of the proper data structure, not a transient optimization. There are no other ways to get to the data in the fingers (it is not part of the main data array), no focus point, and no stabilization. As a side-effect, the data structure is completely immutable and no extra memory fences are required. I am not aware of existing descriptions or implementations of this data structure. Based on related structures "radix-balanced finger tree" seems to be an appropriate name.

A second major difference lies in the implementation. Whereas the old Vector uses a single class for collections of all supported sizes, the new one is split into Vector0 to Vector6 for the different dimensions of the main data array. Vector0 is a singleton object for the empty vector, Vector1 (used for collections up to size 32) is essentially the same as ArraySeq, all higher dimensions are finger trees.

I prefer to think of the nested arrays containing the data as hypercubes instead of trees. For a full VectorN we start with an n-dimensional hypercube data_n and slice off an n-1-dimensional prefix_n-1 and suffix_n-1. We continue to slice off at the beginning of the prefix and the end of the suffix until we reach prefix1 and suffix1. The length of each side of the initial hypercube is 32 (same as the old Vector). After slicing, the highest dimension of data can have a length up to 30, prefix1 and suffix1 can be up to 32, and the intermediate prefixes and suffixes up to 31. When the original data does not fill the entire hypercube, there may be empty parts both at the beginning (the offset) and the end. When the offset is 0 we call the vector "left-aligned". Notice that the slicing ensures that only the highest dimension of each data array can be incomplete. All lower dimensions always have a length of 32 and are filled entirely.

Balancing requires that prefix1 and suffix1 are both non-empty. There is no obvious advantage from doing the same for higher-dimensional prefixes and suffixes; it would actually make appending and prepending more expensive. They only get rebalanced when it is necessary for prefix1 or suffix1. There is no requirement that the main data array be non-empty, either. For example, when appending an element to a a completely filled and left-aligned Vector3, the existing prefix1, prefix2 and data3 become prefix1, prefix2 and prefix3 of a new Vector4. data4, suffix3 and suffix2 remain empty and suffix1 contains the new element.

Building is done with a single NVectorBuilder for all dimensions. It can be initialized with an existing vector structure for more efficient concatenation. I did not get around to implementing an optimized concatenation where the right-hand side vector determines the alignment. Like most other algorithms for this data structure it is quite easy to visualize how it works and assure yourself that it is possible but actually implementing it correctly is much harder. This would allow concatenation of two vectors of sizes m and n in O(min(n, m) + log(n+m)) time.

There is another builder class NVectorSliceBuilder for non-trivial slicing operations. The hypercube fingers are very well suited for slicing because any slicing will automatically result in array slices with dimensions in the correct order for a new vector. Only minor rebalancing is required in some cases.

Memory Usage

The old Vector has a huge overhead for small instances. In addition to the actual data arrays, a Vector object adds 56 bytes (in 64-bit HotSpot with compressed pointers). By having different classes for different dimensions, this goes down to 16 bytes for an NVector1, the same as an immutable.ArraySeq. The overhead grows up to 80 bytes for an NVector6. The break-even is at NVector4. Constant overhead for these larger sizes should be inconsequential. It could be reduced by not storing the intermediate prefix lengths, at the expense of slower access to the prefix.

Additionally, all instances built by VectorBuilder have their data arrays padded to 32, which constitutes an extra 124 byte overhead for a Vector of size 1 that was built with a VectorBuilder. NVectorBuilder always truncates the arrays to avoid this.

Benchmarking

Benchmarks were run with

bench/jmh:run -jvmArgs "-Xms256M -Xmx10G" scala.collection.immutable.VectorBenchmark2.nv
bench/jmh:run -jvmArgs "-Xms256M -Xmx10G" scala.collection.immutable.VectorBenchmark2.v

while the initialization of the unnecessary part (v or nv) was commented out. If you have more than 16G of memory you should be able to choose a larger heap size and run everything in one go.

Interactive diagrams can be found at http://szeiger.de/tmp/vector2/vector-results.html. All diagrams plot time against number of elements, both on logarithmic scales. The diagrams on that page are interactive so you can explore the exact numbers for all data points. Due to the double logarithmic scales and the huge range of instance sizes, even big differences can appear very small in the diagrams.

The diagram generation code is currently part of this draft PR. We may want to split it off into a separate project for more general use or just keep it here because it's only a small amount of code.

Performance

We can see small improvements for sequential apply and very similar results for random apply. Other than micro-optimizations, there isn't much that can be done here:

ApplyRandom
ApplySequential

Updating with random indexes shows the expected logarithmic growth but the new implementation is several times faster due to lower overhead:

UpdateRandom

Sequential updates are the Achilles heel of the radix-balanced finger trees, at least in theory. The movable finger of the old Vector allows sequential updates in what appears to be amortized constant time (i.e. an update positions the finger at the updated position and then further updates in the vicinity can access the location more directly). Due to index computations being based on integers which have to grow logarithmically, this is not really the case but as long as all indexes are represented by the same 32-bit Int type, the performance graph is essentially flat. The new Vector cannot do this optimization and therefore the logarithmic growth is directly visible when stepping up in vector dimensions. Fortunately, it is so much faster to begin with and the base-32 logarithm is so slow that it still wins at larger sizes:

UpdateSequential

Sequential appending and prepending is efficient and has low overhead in the new Vector. The old Vector also makes it efficient by positioning the finger appropriately but the overhead is much larger so that the new implementation ends up being several times faster:

Append
Prepend

Alternating between appending and prepending is a worst-case scenario for the old Vector because the finger always ends up in the wrong location (i.e. you get all the overhead but none of the benefit). The new implementation is about 35 to 40 times faster in this case:

Apprepend

All bulk-appends benefit from initializing an NVectorBuilder with the existing left-hand-side NVector, plus other optimizations. Appending an ArraySeq of the same size is several times faster:

BulkAppend100p

This is even more apparent when we repeatedly append a collection only 1/10 the size where the new implementations is about 25 to 50 times faster in non-trivial cases:

BulkAppend10p

Both implementations have optimizations for small bulk-appends. Below a certain size limit for the collection being appended, individual appends are performed instead of using a builder. The new implementation optimizes this further: When the data to be appended fits completely into suffix1 (which allows for appends up to 31 elements depending on alignment), it is done as a bulk copy into the array without the need to create either a builder or intermediate vectors. As a result, appending a collection of size 2 is about 8 to 10 times faster:

BulkAppend2

When concatenating two vectors, we can make use of the fact that both are built from array slices. We should explore generalizing this to other array-based collections in the future but this will require new abstractions in the collections library. For now the optimization is limited to NVector. Instead of appending the right-hand side element by element to the NVectorBuilder, we can append a whole slice of 32 elements with (depending on alignment) one or two System.arraycopy calls. This is another 2 to 3 times faster than appending individual elements:

BulkAppendSame

Iterating with foreach and iterator is pretty much the same in both implementations, with some wins for better optimization for small instance sizes in the new one:

Foreach
Iterator

With the balancing requirement of non-empty prefix1 and suffix1, both head and last are only one array indexing operation away in the new implementation. The old Vector still does well on head because the instances for the benchmark were built with a VectorBuilder (which always results in the focus being on the first element) but shows logarithmic growth on last, where the new implementation is still O(1):

Head
Last

tail is 8 to 10 times faster in the new Vector. The rebalancing is amortized constant time but it gets a bit slower for higher dimensions because the overhead for managing the instances grows:

Tail

Building with a Builder is a bit faster for most instance sizes in the new implementation. The old one wins for very small instances because it doesn't truncate the arrays and therefore creates a huge constant overhead for small vectors (see above):

Build

map is essentially unoptimized in the old implementation. It uses the default implementation based on Iterator + Builder which is decently optimized and good in terms of big-O-complexity but we can still do much better by rebuilding the existing structure:

MapNew

The new implementation also has map-conserve semantics as an internal optimization. While a map call always results in a new NVector instance, the internal arrays (at all dimensions) are shared if none of the entries change. Since arrays are treated as fully immutable there are no ownership issues. I opted for a conservative implementation which is essentially free for the non-conserving case (i.e. there is no observable performance penalty when mapping everything to a new object). The performance improvements for map(identity) are still decent (but it could be about twice as fast with an dedicated mapConserve implementation):

MapIdentity

@scala-jenkins scala-jenkins added this to the 2.13.2 milestone Nov 10, 2019
@mdedetrich
Copy link
Contributor

Wow, @Ichoran what are your thoughts on this?

@Ichoran
Copy link
Contributor

Ichoran commented Nov 10, 2019

Wow, this looks fantastic, Stefan! I don't have time right now to read through all the code, but the bits that I sampled all look reasonable (fairly repetitive, but that's expected given the nature of the data structure). The algorithmic design seems sound in concept; I'll try it on collection-laws when I get a chance to see how it holds up in practice.

Anyway, this is great! Maybe we could introduce it in a separate dependency in the meantime so we don't have to wait for 2.14 to get it?

@odersky
Copy link
Contributor

odersky commented Nov 10, 2019

This looks really impressive, Stefan! Can we swap the implementations in the 2.13 series? If not, make it the standard for 3.0 maybe?

@diesalbla diesalbla added the library:collections PRs involving changes to the standard collection library label Nov 10, 2019
* the prefix). The level is increased/decreased when the affected side plus main data is already full/empty
* - All arrays are left-aligned and truncated
*
* In addition to the data slices (`prefix`, `prefix2`, ..., `dataN`, ..., `prefix2`, `prefix1`) we store a running
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reference we may cite, regarding the design and implementation of radix-balanced finger trees? For instance, is this ICFP article closely related?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned above I am not aware of any pre-existing references. The Bolivar paper (I found a PDF here) covers RRB trees, like the two papers I referenced in the PR description. My implementation uses non-relaxed trees. Relaxed trees can greatly improve insertion and concatenation performance but will show worse results for indexed access. I think it would be interesting to see if RBF trees can be generalized to RRBF trees. I expect it to be possible in principle but with higher implementation complexity than for a relaxed version of normal RB trees.

The "Off-Tree Tail" and "Left Shadow" optimizations from section 2.3 are worth mentioning in this context. The fingers are a generalization of an off-tree tail. While the off-tree tail is only the last one-dimensional slice, a finger extends this to all dimensions from 1 to n-1 for an n-dimensional vector. The left shadow is implemented by the old Vector. The finger on the left now provides the same advantages with a more compact memory representation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be:

(`prefix`, `prefix2`, ..., `dataN`, ..., `suffix2`, `suffix1`)

?

Copy link
Member Author

@szeiger szeiger Nov 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it should be (`prefix1`, `prefix2`, ..., `dataN`, ..., `suffix2`, `suffix1`). I got 3 of 5 names wrong...

@linasm
Copy link
Contributor

linasm commented Nov 11, 2019

Amazing results! If 2.13 is really out of question, maybe https://github.com/scala/scala-collection-contrib would be the fastest way of making it available?

@dwijnand dwijnand modified the milestones: 2.13.2, 2.14.0-M1 Nov 11, 2019
@odd
Copy link
Contributor

odd commented Nov 11, 2019

Very impressive! 🚀
Wanting to see which binary compatibility problems MiMa detected, I renamed the old implementation OVector and the new one Vector. This should make it easier to generate the MiMa breakage report and it also makes any classes in the standard library using Vector use the new implementation in their tests instead of the old (e.g. VectorMap). The code can be found at szeiger#2.

@retronym
Copy link
Member

retronym commented Nov 12, 2019

We ought to warm up the benchmarks in a way to ensure we measure the megamorphic dispatch of callsites that range overNVectorN.

Right now, I see inlining logs like:

sbt:root> bench/jmh:run -jvmArgs -Xmx1G scala.collection.immutable.VectorBenchmark2.nvApprepend -f1 -psize=500 -jvmArgs -XX:+UnlockDiagnosticVMOptions -jvmArgs -XX:+PrintInlining
...
[info]                               @ 22   org.openjdk.jmh.infra.Blackhole::consume (49 bytes)   disallowed by CompilerOracle
[info]                               @ 19   scala.collection.immutable.VectorBenchmark2::nvApprepend (70 bytes)   force inline by CompilerOracle
[info]                                 @ 1   scala.collection.immutable.VectorBenchmark2::nv (5 bytes)   accessor
[info]                                 @ 6   scala.collection.immutable.VectorBenchmark2::nv (5 bytes)   accessor
[info]                                 @ 14   scala.collection.immutable.VectorBenchmark2::size (5 bytes)   accessor
[info]                                 @ 29   scala.collection.immutable.VectorBenchmark2::o (5 bytes)   accessor
[info]                                 @ 32   scala.collection.immutable.NVector2::appended (6 bytes)   inline (hot)
[info]                                  \-> TypeProfile (300500/300500 counts) = scala/collection/immutable/NVector2
[info]                                   @ 2   scala.collection.immutable.NVector2::appended (197 bytes)   inline (hot)
[info]                                     @ 18   scala.collection.immutable.NVectorStatics$::copyAppend (18 bytes)   inline (hot)
[info]                                       @ 5   java.util.Arrays::copyOf (13 bytes)   inline (hot)
[info]                                         @ 3   java.lang.Object::getClass (0 bytes)   (intrinsic)
[info]                                         @ 6   java.util.Arrays::copyOf (46 bytes)   (intrinsic)
[info]                                     @ 23   scala.collection.immutable.NVector2::length (5 bytes)   accessor
[info]                                     @ 30   scala.collection.immutable.NVector2::copy$default$1 (5 bytes)   accessor
[info]                                     @ 36   scala.collection.immutable.NVector2::copy$default$2 (5 bytes)   accessor
[info]                                     @ 42   scala.collection.immutable.NVector2::copy$default$3 (5 bytes)   accessor
[info]                                     @ 56   scala.collection.immutable.NVector2::copy (15 bytes)   inline (hot)

A quick and dirty way to see the effect is to disable JMH's forking and just run all experiments in its control JVM.

sbt:root> bench/jmh:run scala.collection.immutable.VectorBenchmark2.nvApprepend -f0 -psize=5000,50000,1,500

[info] Benchmark                     (size)  Mode  Cnt        Score       Error  Units
[info] VectorBenchmark2.nvApprepend    5000  avgt    5    92220.333 ±  9314.284  ns/op
[info] VectorBenchmark2.nvApprepend   50000  avgt    5  1029616.649 ± 41164.566  ns/op
[info] VectorBenchmark2.nvApprepend       1  avgt    5       22.443 ±     1.452  ns/op
[info] VectorBenchmark2.nvApprepend     500  avgt    5     9525.892 ±   391.606  ns/op
[success] Total time: 44 s, completed 12/11/2019 1:58:20 PM
sbt:root> bench/jmh:run scala.collection.immutable.VectorBenchmark2.nvApprepend -f0 -psize=500

[info] Benchmark                     (size)  Mode  Cnt     Score     Error  Units
[info] VectorBenchmark2.nvApprepend     500  avgt    5  8126.497 ± 231.999  ns/op

There are a few techniques to harden the benchmark against this problem.

If the only virtual callsites you want to protect are all in the benchmark code itself, you can indirect them through a private method that bears the @jmh.annotations.CompilerControl(DONT_INLINE) annotation.

If you are benchmarking calls into library code that contain the virtual call sites, you need to warm them up with various subclasses (in this case, various sizes of collection).

JMH has a command line option that lets you specify a set of benchmark methods to run during warmup phase. -wm BULK -wmb <regex>. (run bench/jmh:run -help for details).

Adding this to our setup:

    println("size = " + size)
    println("JVM: " + ManagementFactory.getRuntimeMXBean().getName())

Shows that:

sbt>  bench/jmh:run -jvmArgs -Xmx1G scala.collection.immutable.VectorBenchmark2.nvApprepend -f1 -psize=1,50,500,5000 -wm BULK

Warms up all sizes in the same JVM before running the measurement for a particular size. However, it will use a fresh JVM for each size, which means N^2 benchmark execution time. I don't see an extension point in JMH to customize this. We could patch this part of JMH to add support for a new warmup mode that shared a measurement JVM for parameter variations of a given benchmark.

Perhaps the simplest option for now would be some code in the setup method that runs each benchmark method at a few sizes.

* - prefix1 and suffix1 are never empty
* - Balancing does not cross the main data array (i.e. prepending never touches the suffix and appending never touches
* the prefix). The level is increased/decreased when the affected side plus main data is already full/empty
* - All arrays are left-aligned and truncated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to encode some of these invariants in a checkConsistent method which we could call incorporate into a Scalacheck test suite that performed random operations and asserts the internal structure is consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are validation methods in the JUnit test. I originally had them directly in NVector but factored them out to keep the code size down.

case that: NVector[_] => false
case o => super.equals(o)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep optimized implementations of equals / sameElements for Vector / Vector comparisons short circuiting when the lengths differ and avoiding element comparison when this and that structurally share a backing array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old optimizations don't make sense here. The only case where you would benefit from the shortcut is when comparing the result of map(identity). If the vectors were built separately, they won't share any arrays. Doing structural comparisons on each array separately might be useful. This could cover cases such as dropping the head element and then consing it back on. The implementation would be rather complex and you need to take care not to compromise the more common cases though. You cannot handle all arrays individually because equal vectors can be balanced differently. In the old Vector the only possible difference was the offset but now we have to deal with all the different prefix lengths.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szeiger Would (NVector.tabulate(1000)(i => i)).equals(NVector.tabulate(999)(i => i)) short circuit based on a cheap length call, as per Vector?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this uses the default implementation of sameElements which checks knownSize first (and the size of an IndexedSeq is always known).

@szeiger
Copy link
Member Author

szeiger commented Nov 19, 2019

I ran some new benchmarks with megamorphic dispatch. The results (at http://szeiger.de/tmp/vector2mega/vector-results.html) unfortunately mean that the new vector is not always faster which makes the argument for it much harder.

  • apply is roughly 10% slower than the old version (both for sequential and random access)
  • head is roughly 30% slower than the old version
  • last starts out the same as head but is faster above 1000 elements because it is O(1) in the new design and O(log n) in the old one
  • Iterating with iterator is 10% to 20% slower than the old version

There are positive effects, too. Well, positive for my new implementation, because some operations in the old vector also suffer from the varying sizes. It can't be megamorphic method dispatch. My guess is that previously the same branches were taken and got highly optimized. Now we get more variation which has huge performance effects. For more complex algorithms this benefits the new implementation. Once the initial impact of the non-inlined call site has been absorbed, the rest is still specific to a certain dimension. The effect can be seen in sequential updates (last diagram on the page). The old vector used to be around 150μs independent of size. Now it starts at this level for small sizes but then appears to be growing logarithmically (which it really shouldn't, given the algorithm). Maybe @retronym has a better idea about what's going on.

(tail for size=1 looks much faster in the old vector now. The new vector shows similar results in the non-megamorphic benchmarks. I noticed these oddities before but didn't point them out because they affected both versions equally. This appears to be a benchmarking artifact. Even though the size is initialized as a parameter, HotSpot is apparently still able to constant-fold the whole call graph into an empty vector. ~5ns is typically the result I get on this machine for a no-op.)

@retronym
Copy link
Member

head is roughly 30% slower than the old version

I believe you could fix this particular slowdown by using a field in the base class to store NVector1.data1 / {NVector0..6}.prefix1 in the same field and pulling having a single, final implementation of head up there.

My guess is that previously the same branches were taken and got highly optimized.

Yes, benchmarking with polluted branch profiles (for the JIT or even the CPU branch predictor) is yet another best practice. If we want to dig in to this we could start by using JITWatch to compare.

@mkeskells
Copy link
Contributor

Are there any benchmarks for allocations and retained memory usage. In large apps this is as much a problem as cpu usage.

CPU improvements look good

@szeiger
Copy link
Member Author

szeiger commented Nov 20, 2019

I believe you could fix this particular slowdown by using a field in the base class to store NVector1.data1 / {NVector0..6}.prefix1 in the same field and pulling having a single, final implementation of head up there.

Yes, I had the same idea and I already did that yesterday and ran some benchmarks over night. The results look promising: http://szeiger.de/tmp/vector2mega2/vector-results.html. head is consistently faster again and other results appear to be unaffected. I'll run a variation of this tonight to get at least bimorphic dispatch for last.

Unfortunately apply is the one really important method that has no easy fix.

Are there any benchmarks for allocations and retained memory usage. In large apps this is as much a problem as cpu usage.

I didn't run any benchmarks for allocations. My expectation is that bulk operations may do some more one-time allocations (e.g. in NVectorBuilder.result). Allocations while building are the same (I even use the old algorithm for allocating new vector slices).

Any operation on the prefix or suffix (tail, init, appended, prepended, and updates in the prefix or suffix) will allocate less. Only sequential updates in the main area of large vectors are expected to do worse because they now have to do a full path copy every time.

Retained memory for small vectors should be much better than before and start to even out as instances get larger. Constant overhead is a bit higher in dimensions 5 and 6 but the small advantage for the old implementation only holds as long as the vectors are left-aligned and right-truncated (i.e. they were built by appending individual elements, not with a builder), otherwise the small constant savings are dwarfed by the padding overhead. Overall, retained memory should be very close to optimal for NVector.

NVectorIterator keeps a reference to the whole source vector while iterating but that was also true for VectorIterator. In both implementations there are no easy ways to improve this. You would have to trade retained memory for more allocations and slower performance.

@szeiger
Copy link
Member Author

szeiger commented Nov 20, 2019

I pushed an update with the head changes and some cleanups. The bimorphic version of last doesn't have any advantages over the megamorphic one. It's actually a bit slower in my benchmarks (but that may be just noise).

@szeiger
Copy link
Member Author

szeiger commented Nov 20, 2019

Interesting result: Using NVectorIterator everywhere (so you get a monomorphic iterator call and monomorphic calls into the Iterator's methods) is detrimental to iterator performance! This looks similar to the branch optimization issues we see with the old Vector. Perhaps splitting NVectorIterator into multiple versions specific to one level could be useful. I wanted to do that initially but having a single implementation turned out to be simpler.

@szeiger
Copy link
Member Author

szeiger commented Nov 20, 2019

I benchmarked some alternatives for iterator. Unfortunately, all were much slower. It looks like we're already hitting the sweet spot of optimization with the separate iterators for levels 0 and 1 and a shared implementation for higher dimensions. If you look at the warmup data, it's clear that HotSpot is really screwing it up in this case. Warmup goes from small to large sizes and every single measurement along the way is much better than what you get in the end during the actual benchmarked iterations.

Of course, what this also means is that the iterator performance is very dependent on warmup and getting results within a few percent of the old implementation is probably fine.

@joshlemer
Copy link
Member

@szeiger did you ever try a version of this new structure which does not split out implementations based on depth? i.e., just have all fields present, but perhaps null, and logic is dispatched based on some private[this] val depth: Int?

@szeiger
Copy link
Member Author

szeiger commented Nov 21, 2019

Having eveything in one class would add a lot of memory overhead for small instances, even more than the old vector (but with the truncated arrays we should still come up ahead in the end). After seeing the effects of branch prediction in the benchmarks for the old Vector, I doubt that it would be beneficial for performance. If you have to switch anyway in the critical code paths you may as well make HotSpot do it.

I'll see if I can run some tests for this. It shouldn't be that hard to try it for a subset of operations.

@szeiger
Copy link
Member Author

szeiger commented Nov 21, 2019

Here's a single-class implementation: https://github.com/szeiger/scala/blob/wip/new-vector-one-class/src/library/scala/collection/immutable/NVector.scala. I started the benchmarking for apply but aborted when I saw the first results scrolling by. It's about 2x slower for sequential access in small vectors. A few methods that operate on the prefix and suffix can be written with a single implementation without looking at the dimension but other than last for small sizes they are already much faster.

@smarter
Copy link
Member

smarter commented Nov 21, 2019

Have you tried benchmarking with graal? Might be useful for future-proofing the work done here.

@retronym
Copy link
Member

Having eveything in one class would add a lot of memory overhead for small instances,

Another way to factor this out would be to keep all the subclasses for the data, but move the all the implementations up into the base class. Those implementations would look a lot like the ones we have to day: switch initially on something to discriminate the size, then cast to the subclass to get out the data.

I would guess that the fastest way would be to have a depth field that could be used in a (depth: @switch) match { ... }. You could avoid the space of needing that field and instead have the cascade of isInstanceOf checks.

This is all getting close to the trickery from:
https://shipilev.net/blog/2015/black-magic-method-dispatch/#_cheating_the_runtime_2

The warnings and conclusions at the end of that article are worth reading. As @smarter cautions, we don't want to overfit to C2 without also trying Graal.

@retronym
Copy link
Member

retronym commented Nov 22, 2019

One small improvement would be to share a field for length() to make that monomorphic. I've added a commit to that effect in: retronym#78.

I've also pushed a commit that makes apply a final method of NVector with dispatches on a new field depth. It than statically dispatches to the sub-class-specific implementations.

@retronym
Copy link
Member

I ran some new benchmarks with megamorphic dispatch

@szeiger What change did you make to enable pollute the profiles? I don't see any changes to NVector2Becnchmark pushed to this PR.

@retronym
Copy link
Member

It's about 2x slower for sequential access in small vectors.

Just a guess, but this could be due to the the new implementation using explicit bounds checking, whereas before NVector1.apply was written to use implicit bounds checking on array access.

This suggests yet another sort of profile pollution (😅 ) -- calling Vector(-1) and Vector(Int.MaxVaklue) for each dimension of NVector)

@szeiger
Copy link
Member Author

szeiger commented Nov 22, 2019

Here's a complete run with the latest Graal CE using mixed-size warmup: http://szeiger.de/tmp/vector2graalce/vector-results.html

TL;DR: There are no big surprises that would change the general performance assessment. The problem with the 10% lower performance in apply remains, the performance issue in last is gone on Graal.

There are some notable performance differences and in most cases they benefit the old implementation but these tend to be benchmarks where the new vector has a huge performance lead on HotSpot anyway so the lead becomes smaller but it's still much faster in the end.

  • appended is much faster for the old vector and a bit faster for the new one. The new one is still much faster than the old one.
  • Alternating appended / prepended: No significant changes
  • apply is a bit faster for both, the old implementation still has roughly a 10% performance advantage.
  • The Builder for the old vector has the same performance as on HotSpot, the new one one loses its small performance advantage and runs at almost exactly the same speed (slower than on HotSpot)
  • The regular appendedAll (100% and 10% size benchmarks) is much faster for the old vector and much slower for the new one than on HotSpot. A significant performance advantage for the new one remains.
  • The optimized appendedAll for small suffixes is much faster than on HotSpot for both implementations.
  • Optimized vector appending is a bit slower than on HotSpot in the new implementation, the (non-optimized) counterpart in the old implementation a bit faster. No big change in general.
  • foreach: No significant changes
  • head: Both are significantly faster, no change in relative performance
  • last: Both are a bit faster, the new implementation more so than the old one. This means that the new implementation is now faster above size 1, not only above size 1000.
  • Iterating with iterator: Same performance for the old implementation. The new one gets a few percent faster and is now at the same level.
  • map(identity): Similar performance for the old vector, the new one is more than twice as slow as on HotSpot (but still much faster than the old one)
  • Mapping to a new value: The old implementation gets a bit faster, the new one a bit slower. No big changes.
  • prepend: The old implementation is roughly twice as fast, the new one a bit slower. Still much faster than the old one.
  • tail: Both are significantly faster, same relative performance. HotSpot was able to elide the trivial size=1 case for the old, monomorphic call in the benchmark but not the new, polymorphic one. Graal seems to be able to do it for both.
  • update: The branch prediction issue we saw with HotSpot for the old implementation is gone, the graph shows the expected constant-time sequential update performance and it is twice as fast as on HotSpot. The new implementation is a bit slower now, still much faster than the old one.

It is also interesting to compare these results to the original ones with single-size warmup. Most of the cases where we see a significant speed-up in Graal (appended (old implementation), regular appendedAll (old), small suffix appendedAll (both), prepended (old)) are essentially back to their single-size warmup performance on HotSpot. Graal doesn't fall into the branch predication trap that kills performance on HotSpot if you warm up with the "wrong" sizes.

@lrytz
Copy link
Member

lrytz commented Mar 10, 2020

@dwijnand could you summarize the jardiff comparison you did for this PR?

@dwijnand
Copy link
Member

I ran subsets of the community build and compared its locally published jars with jardiff using Scala:

  • 2.13.2-bin-6f86e6b-SNAPSHOT, which was, before Stefan's last force push, the commit this PR branch was building from (6f86e6b), with
  • 2.13.2-bin-28ff22a-SNAPSHOT, which was, also before the last push, the head of this PR (28ff22a)

against 4 projects:

There's some unfortunate sbt-osgi/dbuild interaction noise going on in scala-parallel-collections, but outside of that and version string things, the jars were all identical. Which, in turn, means that any combination of already built library jars already and library jars rebuilt with 2.13.2 Scala would work with either old and new 2.13.x scala-library.jar.

@SethTisue SethTisue changed the title Radix-Balanced Finger Tree Vectors Replace scala.collection.immutable.Vector with a completely different data structure ("radix-balanced finger tree vectors"), for performance Mar 10, 2020
@lrytz
Copy link
Member

lrytz commented Mar 10, 2020

If no objections are raised, I'll merge this PR in the next 1-2 days.

@dwijnand dwijnand changed the title Replace scala.collection.immutable.Vector with a completely different data structure ("radix-balanced finger tree vectors"), for performance Rewrite Vector (now "radix-balanced finger tree vectors"), for performance Mar 17, 2020
@lrytz lrytz merged commit b4428c8 into scala:2.13.x Mar 18, 2020
@cbfiddle
Copy link

Has anyone tried to use the builder to create a vector of the maximum size?
Unfortunately, I am not in a position to try it, but looking at the code, I don't think it would work.
} else if (xor < WIDTH6) { // level = 5
WIDTH6 does not take into account the double maximum size.
a6((idx >>> BITS5) & MASK) = a5
Ditto for MASK.

@mdedetrich
Copy link
Contributor

@cbfiddle I would file an issue if you validate its an actual problem.

@cbfiddle
Copy link

It appears there is another size limitation. When a Vector is created by repeated use of "appended", the prefix2 through prefix5 arrays are limited to containing 30 elements, even though the data structure and the builder permit 31. That is because these prefix arrays are initialized from the previous data array, which is limited to 30 elements, and never extended. An analogous limitation applies to prepended. The result is that appended/prepended throws an exception before the advertised maximum size = Int.MaxValue (2^31-1) is reached [see LASTWIDTH].

@Ichoran
Copy link
Contributor

Ichoran commented Jul 19, 2021

@cbfiddle - The design of Vector has always been to have a maximum of 2^30 - 1 elements. If there is any documentation that indicates otherwise (e.g. Int.MaxValue or Int.MaxValue-1 or any such thing), it is in error and should be fixed.

@Ichoran
Copy link
Contributor

Ichoran commented Jul 19, 2021

When a Vector is created by repeated use of "appended", the prefix2 through prefix5 arrays are limited to containing 30 elements

I haven't looked at the code, but I don't think this is possible. This would leave holes in the trie structure and mess up indexing. Are you sure? If you do "appended" 1025 times off an empty vector, then v(1024) should work but might pick up the wrong index, and v(1022) should not work, because it would index into the last slot of prefix2.

@cbfiddle
Copy link

@Ichoran I don't know if you would call a comment documentation, but the definition of LASTWIDTH says that directly. I copied and pasted the text. The existence of LASTWIDTH is an indication that the author intended to support 2^31-1 elements, but it appears that intension was not carried all the way through. I would suggest looking at the code. When a prefix is passed to the constructor, a cumulative element count is also passed in, which is used for indexing, thus avoiding the problem you describe. It appears that indexing works all the way to 2^31-1, but construction does not.

@Atry
Copy link
Contributor

Atry commented Mar 22, 2022

Why isn't vBulkAppendSame O(log(n))? I suppose merging two finger trees is an O(log(n)) operator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
library:collections PRs involving changes to the standard collection library release-notes worth highlighting in next release notes
Projects
None yet