Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce iterator allocations on hot paths #10745

Merged
merged 4 commits into from
Apr 4, 2024

Conversation

johnbley
Copy link
Member

@johnbley johnbley commented Mar 4, 2024

Allocating iterators on hot paths adds unnecessary overhead. Prefer forEach or int i=0 iteration for these locations. This is a straightforward optimization that doesn't add additional code complexity.

I looked for static analysis tools to do this across the code base, and did not find anything particularly great. Intellij's Loop can be collapsed with stream api suggestions came the closest, but sometimes produced uglier code than the original or sometimes was questionable if it impacted performance or not.

This is not an exhaustive exploration of this topic, but just a first and easy cut at it.

@johnbley johnbley requested a review from a team as a code owner March 4, 2024 13:31
Copy link
Contributor

@steverao steverao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if these adjustments are necessary. It would be better if there were some performance comparisons. If there is no particularly obvious performance gap, it may be a better suggestion not to modify it, because the entire code base has various usages. How about you? @laurit

@trask
Copy link
Member

trask commented Mar 12, 2024

hey @johnbley, can you send a JMH test showing the difference between

for (Object item : list) {
  ...
}

and

for (int i = 0; i < list.size(); i++) {
  Object item = list.get(i);
  ...
}

I would like to think that any difference would get JIT'd away, but definitely curious now...

you can add any new JMH tests here: https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/javaagent-tooling/src/jmh

@johnbley
Copy link
Member Author

OK, so I looked into benchmarking this and found some interesting stuff.

First, a bit of background:

I've been doing this stuff for more than two decades now and (historically) for optimizing java instrumentation I've found that paying attention to each allocation in the instrumented path is a great way to improve overhead. I threw together a simple utility that makes 100,000 faked-out servlet calls (that don't actually use the real servlet implementation and allocate no objects themselves). Our instrumentation generated ~10.8 million garbage objects (108 per instrumented invocation) on the main path alone - ignoring premain overhead (which is substantial) and the cost of exporting. (The byte-volume of this main-loop garbage was 340 MB or 3.4 KB per invocation.) This analysis generated a lot of ideas about how to improve things, some of which are easy and quick, like open-telemetry/opentelemetry-java#6244 and this PR. Neither of these made massive strides in allocation byte-volume, but, again, these seemed like pretty obvious "this is better and don't make the code that much more complex if at all" wins. I'll write up more on the larger (and more impactful) optimizations I found in the future when I find some time to finish the code path analysis on the refactorings necessary.

OK, so:

  • I focused in on ways of iterating short loops (n=3 or 4) since those are the bulk of our iterators in this patch - we don't care about the costs of iterating a billion elements because we don't tend to have such data structures in our code.
  • Probably not surprising anybody, a jmh microbenchmark of just short loops iterating in various ways found no effective difference in allocation rates or speed between forEach, for(int i=0;, and, Iterator. In tight microbenchmarks, the jit does in fact do a good job at escape analysis and eliding away the allocation. I'm not going to bother adding that to this PR because it doesn't show anything interesting.
  • However, ./gradlew :benchmark-overhead-jmh:jmh showed that gc.alloc.rate.norm went from 41559 B/op to 41328 B/op. This is pretty consistent with what the "analyze each allocated object" would have guessed for reductions (9 objects at 24 bytes per object). Why is this? Probably because the jit didn't optimize them away in larger/more complex call paths. I was unable to compare cpu/time per op because my laptop won't run this benchmark reliably, but I doubt that this small of an optimization would show up in actual instrumentation latency overhead yet.
  • I further drew up a specifically "bad" (according to jmh philosophy) microbenchmark (throwaway code below) which doesn't use warmup loops and found that forEach and for(int i=0; always start fast on first invocation, but Iterator only becomes fast later after the jit decides it's worth optimizing.
  • So, these methods (avoiding Iterator allocations) do in fact seem to affect performance in several real-world ways.

"Bad" (no warmup loops) microbenchmark data - you can see that the Iterator method takes a while to become fast, but the others always start fast with the first invocations - I see similar numbers with several JVM versions:

forEaches took 55
forIndex took 44
iterators took 196

forEaches took 43
forIndex took 47
iterators took 112

forEaches took 39
forIndex took 31
iterators took 63

forEaches took 40
forIndex took 32
iterators took 62

forEaches took 38
forIndex took 31
iterators took 61

"Bad" microbenchmark code:

import java.util.Arrays;
import java.util.ArrayList;
import java.util.concurrent.atomic.AtomicInteger;

public class Bar {
  public static class SomeObject {
    public void doNothing() {
    }
  }
  public static final ArrayList<SomeObject> TheList = new ArrayList<>(
	Arrays.asList(new SomeObject(), new SomeObject(), new SomeObject()));
  
  public static void iterators(final int n) {
    for(int i=0; i < n; i++) {
      for(SomeObject s : TheList) {
        s.doNothing();
      }
    }
  }

  public static void forEaches(final int n) {
    for(int i=0; i < n; i++) {
      TheList.forEach(s -> s.doNothing());
    }
  }

  public static void for_Index(final int n) {
    for(int i=0; i < n; i++) {
      int size=TheList.size();
      for(int j=0; j < size; j++) {
        TheList.get(j).doNothing();
      }
    }
  }

  public static void main(String[] args) {
    long start = 0;
    long elapsed = 0;
    final int n = 10000000;

    for(int i=0; i < 10; i++) {

    start = System.currentTimeMillis();
    forEaches(n);
    elapsed = System.currentTimeMillis() - start;
    System.out.println("forEaches took "+elapsed);

    start = System.currentTimeMillis();
    for_Index(n);
    elapsed = System.currentTimeMillis() - start;
    System.out.println("forIndex took "+elapsed);

    start = System.currentTimeMillis();
    iterators(n);
    elapsed = System.currentTimeMillis() - start;
    System.out.println("iterators took "+elapsed);

    System.out.println();
    try { Thread.sleep(1000); } catch (Exception e) {} 
    }
  }
}

for (SpanLinksExtractor<? super REQUEST> spanLinksExtractor : spanLinksExtractors) {
spanLinksExtractor.extract(spanLinksBuilder, parentContext, request);
}
spanLinksExtractors.forEach(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't try this out but I suspect that this is a capturing lambda that might also allocate. I think we should do some micro benchmarks to verify whether this allocates or not and whether it ends up allocating more or less than the iterator. It might be easier to replace the lists with arrays and keep using the extended for on them as that doesn't need to allocate anything for iteration. Alternatively could also use a counted loop over the array as that doesn't need to allocate the iterator, though extended for will probably look nicer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding of invokedynamic and lambda compilation this doesn't look like an allocating/capturing instance, but I could be wrong. Also totally happy to switch many of these fields from List<? extends Foo> to simple Foo[] - just let me know which style you want.


BySpanKey(Set<SpanKey> spanKeys) {
this.spanKeys = spanKeys;
this.spanKeys = spanKeys.toArray(new SpanKey[spanKeys.size()]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using new SpanKey[0] might be a bit more performant on jdk 8 https://shipilev.net/blog/2016/arrays-wisdom-ancients/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to change if you like, but note that this is not in the hot path and only happens per-Instrumenter at startup.

@johnbley
Copy link
Member Author

johnbley commented Apr 1, 2024

Happy to change based on some of the feedback here; please just let me know what you think.

@trask
Copy link
Member

trask commented Apr 1, 2024

@johnbley
Copy link
Member Author

johnbley commented Apr 2, 2024

can you add a new section after https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/contributing/style-guideline.md#javautilstreamstream-usage to document the usage recommendations?

I went with @laurit's reccomendation and dropped the stream api usage entirely in favor of simple arrays.

// operation listeners run after span start, so that they have access to the current span
// for capturing exemplars
long startNanos = getNanos(startTime);
for (OperationListener operationListener : operationListeners) {
context = operationListener.onStart(context, attributes, startNanos);
for (int i = 0; i < operationListeners.length; i++) {
Copy link
Contributor

@laurit laurit Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could leave this as for (OperationListener operationListener : operationListeners) {
though as the doEnd needs to iterate in reverse order that one needs to use the counted loop, so using a counted loop here is also fine

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I kept it that way for the symmetry.

@trask
Copy link
Member

trask commented Apr 4, 2024

@johnbley can you add a new section at the end of this page to mention avoiding iterator allocations on hot paths?

this helps us to have something to point people at for maintaining consistency

@github-actions github-actions bot requested a review from theletterf April 4, 2024 12:23
@johnbley
Copy link
Member Author

johnbley commented Apr 4, 2024

@johnbley can you ... mention avoiding iterator allocations on hot paths?

Took a stab at it, let me know what you think.

@trask
Copy link
Member

trask commented Apr 4, 2024

Took a stab at it, let me know what you think.

Thanks!

@trask trask merged commit b381b01 into open-telemetry:main Apr 4, 2024
49 checks passed
@johnbley johnbley deleted the reduce_iterator_allocations branch April 5, 2024 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants