Reduce iterator allocations on hot paths #10745

johnbley · 2024-03-04T13:31:57Z

Allocating iterators on hot paths adds unnecessary overhead. Prefer forEach or int i=0 iteration for these locations. This is a straightforward optimization that doesn't add additional code complexity.

I looked for static analysis tools to do this across the code base, and did not find anything particularly great. Intellij's Loop can be collapsed with stream api suggestions came the closest, but sometimes produced uglier code than the original or sometimes was questionable if it impacted performance or not.

This is not an exhaustive exploration of this topic, but just a first and easy cut at it.

steverao

I'm not sure if these adjustments are necessary. It would be better if there were some performance comparisons. If there is no particularly obvious performance gap, it may be a better suggestion not to modify it, because the entire code base has various usages. How about you? @laurit

trask · 2024-03-12T04:00:38Z

hey @johnbley, can you send a JMH test showing the difference between

for (Object item : list) {
  ...
}

and

for (int i = 0; i < list.size(); i++) {
  Object item = list.get(i);
  ...
}

I would like to think that any difference would get JIT'd away, but definitely curious now...

you can add any new JMH tests here: https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/javaagent-tooling/src/jmh

johnbley · 2024-03-15T13:53:50Z

OK, so I looked into benchmarking this and found some interesting stuff.

First, a bit of background:

I've been doing this stuff for more than two decades now and (historically) for optimizing java instrumentation I've found that paying attention to each allocation in the instrumented path is a great way to improve overhead. I threw together a simple utility that makes 100,000 faked-out servlet calls (that don't actually use the real servlet implementation and allocate no objects themselves). Our instrumentation generated ~10.8 million garbage objects (108 per instrumented invocation) on the main path alone - ignoring premain overhead (which is substantial) and the cost of exporting. (The byte-volume of this main-loop garbage was 340 MB or 3.4 KB per invocation.) This analysis generated a lot of ideas about how to improve things, some of which are easy and quick, like open-telemetry/opentelemetry-java#6244 and this PR. Neither of these made massive strides in allocation byte-volume, but, again, these seemed like pretty obvious "this is better and don't make the code that much more complex if at all" wins. I'll write up more on the larger (and more impactful) optimizations I found in the future when I find some time to finish the code path analysis on the refactorings necessary.

OK, so:

I focused in on ways of iterating short loops (n=3 or 4) since those are the bulk of our iterators in this patch - we don't care about the costs of iterating a billion elements because we don't tend to have such data structures in our code.
Probably not surprising anybody, a jmh microbenchmark of just short loops iterating in various ways found no effective difference in allocation rates or speed between forEach, for(int i=0;, and, Iterator. In tight microbenchmarks, the jit does in fact do a good job at escape analysis and eliding away the allocation. I'm not going to bother adding that to this PR because it doesn't show anything interesting.
However, ./gradlew :benchmark-overhead-jmh:jmh showed that gc.alloc.rate.norm went from 41559 B/op to 41328 B/op. This is pretty consistent with what the "analyze each allocated object" would have guessed for reductions (9 objects at 24 bytes per object). Why is this? Probably because the jit didn't optimize them away in larger/more complex call paths. I was unable to compare cpu/time per op because my laptop won't run this benchmark reliably, but I doubt that this small of an optimization would show up in actual instrumentation latency overhead yet.
I further drew up a specifically "bad" (according to jmh philosophy) microbenchmark (throwaway code below) which doesn't use warmup loops and found that forEach and for(int i=0; always start fast on first invocation, but Iterator only becomes fast later after the jit decides it's worth optimizing.
So, these methods (avoiding Iterator allocations) do in fact seem to affect performance in several real-world ways.

"Bad" (no warmup loops) microbenchmark data - you can see that the Iterator method takes a while to become fast, but the others always start fast with the first invocations - I see similar numbers with several JVM versions:

forEaches took 55
forIndex took 44
iterators took 196

forEaches took 43
forIndex took 47
iterators took 112

forEaches took 39
forIndex took 31
iterators took 63

forEaches took 40
forIndex took 32
iterators took 62

forEaches took 38
forIndex took 31
iterators took 61

"Bad" microbenchmark code:

import java.util.Arrays;
import java.util.ArrayList;
import java.util.concurrent.atomic.AtomicInteger;

public class Bar {
  public static class SomeObject {
    public void doNothing() {
    }
  }
  public static final ArrayList<SomeObject> TheList = new ArrayList<>(
	Arrays.asList(new SomeObject(), new SomeObject(), new SomeObject()));
  
  public static void iterators(final int n) {
    for(int i=0; i < n; i++) {
      for(SomeObject s : TheList) {
        s.doNothing();
      }
    }
  }

  public static void forEaches(final int n) {
    for(int i=0; i < n; i++) {
      TheList.forEach(s -> s.doNothing());
    }
  }

  public static void for_Index(final int n) {
    for(int i=0; i < n; i++) {
      int size=TheList.size();
      for(int j=0; j < size; j++) {
        TheList.get(j).doNothing();
      }
    }
  }

  public static void main(String[] args) {
    long start = 0;
    long elapsed = 0;
    final int n = 10000000;

    for(int i=0; i < 10; i++) {

    start = System.currentTimeMillis();
    forEaches(n);
    elapsed = System.currentTimeMillis() - start;
    System.out.println("forEaches took "+elapsed);

    start = System.currentTimeMillis();
    for_Index(n);
    elapsed = System.currentTimeMillis() - start;
    System.out.println("forIndex took "+elapsed);

    start = System.currentTimeMillis();
    iterators(n);
    elapsed = System.currentTimeMillis() - start;
    System.out.println("iterators took "+elapsed);

    System.out.println();
    try { Thread.sleep(1000); } catch (Exception e) {} 
    }
  }
}

...ion-api/src/main/java/io/opentelemetry/instrumentation/api/instrumenter/SpanSuppressors.java

laurit · 2024-03-26T14:26:46Z

...tation-api/src/main/java/io/opentelemetry/instrumentation/api/instrumenter/Instrumenter.java

-    for (SpanLinksExtractor<? super REQUEST> spanLinksExtractor : spanLinksExtractors) {
-      spanLinksExtractor.extract(spanLinksBuilder, parentContext, request);
-    }
+    spanLinksExtractors.forEach(


I didn't try this out but I suspect that this is a capturing lambda that might also allocate. I think we should do some micro benchmarks to verify whether this allocates or not and whether it ends up allocating more or less than the iterator. It might be easier to replace the lists with arrays and keep using the extended for on them as that doesn't need to allocate anything for iteration. Alternatively could also use a counted loop over the array as that doesn't need to allocate the iterator, though extended for will probably look nicer.

Based on my understanding of invokedynamic and lambda compilation this doesn't look like an allocating/capturing instance, but I could be wrong. Also totally happy to switch many of these fields from List<? extends Foo> to simple Foo[] - just let me know which style you want.

laurit · 2024-03-26T14:41:57Z

...ion-api/src/main/java/io/opentelemetry/instrumentation/api/instrumenter/SpanSuppressors.java


    BySpanKey(Set<SpanKey> spanKeys) {
-      this.spanKeys = spanKeys;
+      this.spanKeys = spanKeys.toArray(new SpanKey[spanKeys.size()]);


using new SpanKey[0] might be a bit more performant on jdk 8 https://shipilev.net/blog/2016/arrays-wisdom-ancients/

Happy to change if you like, but note that this is not in the hot path and only happens per-Instrumenter at startup.

johnbley · 2024-04-01T14:21:12Z

Happy to change based on some of the feedback here; please just let me know what you think.

trask · 2024-04-01T21:01:11Z

can you add a new section after https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/contributing/style-guideline.md#javautilstreamstream-usage to document the usage recommendations?

johnbley · 2024-04-02T17:55:43Z

can you add a new section after https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/contributing/style-guideline.md#javautilstreamstream-usage to document the usage recommendations?

I went with @laurit's reccomendation and dropped the stream api usage entirely in favor of simple arrays.

laurit · 2024-04-03T11:30:39Z

...tation-api/src/main/java/io/opentelemetry/instrumentation/api/instrumenter/Instrumenter.java

      // operation listeners run after span start, so that they have access to the current span
      // for capturing exemplars
      long startNanos = getNanos(startTime);
-      for (OperationListener operationListener : operationListeners) {
-        context = operationListener.onStart(context, attributes, startNanos);
+      for (int i = 0; i < operationListeners.length; i++) {


could leave this as for (OperationListener operationListener : operationListeners) {
though as the doEnd needs to iterate in reverse order that one needs to use the counted loop, so using a counted loop here is also fine

Yeah, I kept it that way for the symmetry.

trask · 2024-04-04T00:50:00Z

@johnbley can you add a new section at the end of this page to mention avoiding iterator allocations on hot paths?

this helps us to have something to point people at for maintaining consistency

johnbley · 2024-04-04T12:24:04Z

@johnbley can you ... mention avoiding iterator allocations on hot paths?

Took a stab at it, let me know what you think.

trask · 2024-04-04T23:58:32Z

Took a stab at it, let me know what you think.

Thanks!

Reduce iterator allocations on hot paths

3a0cb1c

johnbley requested a review from a team as a code owner March 4, 2024 13:31

steverao reviewed Mar 4, 2024

View reviewed changes

laurit reviewed Mar 26, 2024

View reviewed changes

...ion-api/src/main/java/io/opentelemetry/instrumentation/api/instrumenter/SpanSuppressors.java Outdated Show resolved Hide resolved

Array iteration does not allocate an Iterator

342bd23

laurit reviewed Mar 26, 2024

View reviewed changes

Use plain arrays instead of stream apis.

ec87e5c

laurit reviewed Apr 3, 2024

View reviewed changes

laurit approved these changes Apr 3, 2024

View reviewed changes

trask approved these changes Apr 4, 2024

View reviewed changes

steverao approved these changes Apr 4, 2024

View reviewed changes

Document/discuss hot path object allocation in the style guidelines

08c0425

github-actions bot requested a review from theletterf April 4, 2024 12:23

trask merged commit b381b01 into open-telemetry:main Apr 4, 2024
49 checks passed

johnbley deleted the reduce_iterator_allocations branch April 5, 2024 12:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce iterator allocations on hot paths #10745

Reduce iterator allocations on hot paths #10745

johnbley commented Mar 4, 2024

steverao left a comment •

edited

trask commented Mar 12, 2024

johnbley commented Mar 15, 2024

laurit Mar 26, 2024

johnbley Mar 26, 2024

laurit Mar 26, 2024

johnbley Mar 26, 2024

johnbley commented Apr 1, 2024

trask commented Apr 1, 2024

johnbley commented Apr 2, 2024

laurit Apr 3, 2024 •

edited

johnbley Apr 3, 2024

trask commented Apr 4, 2024

johnbley commented Apr 4, 2024

trask commented Apr 4, 2024

Reduce iterator allocations on hot paths #10745

Reduce iterator allocations on hot paths #10745

Conversation

johnbley commented Mar 4, 2024

steverao left a comment • edited

Choose a reason for hiding this comment

trask commented Mar 12, 2024

johnbley commented Mar 15, 2024

laurit Mar 26, 2024

Choose a reason for hiding this comment

johnbley Mar 26, 2024

Choose a reason for hiding this comment

laurit Mar 26, 2024

Choose a reason for hiding this comment

johnbley Mar 26, 2024

Choose a reason for hiding this comment

johnbley commented Apr 1, 2024

trask commented Apr 1, 2024

johnbley commented Apr 2, 2024

laurit Apr 3, 2024 • edited

Choose a reason for hiding this comment

johnbley Apr 3, 2024

Choose a reason for hiding this comment

trask commented Apr 4, 2024

johnbley commented Apr 4, 2024

trask commented Apr 4, 2024

steverao left a comment •

edited

laurit Apr 3, 2024 •

edited