Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up v8 heap snapshots #702

Merged
merged 27 commits into from Jul 27, 2023
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
7d91e99
Speeding up v8 heap snapshots
jdapena Jun 20, 2023
485d04f
Apply suggestions from Rob Palmer
jdapena Jun 21, 2023
3dbf61b
Fix lint issues
jdapena Jun 22, 2023
b6860f4
Apply suggestions from Joyee Cheung
jdapena Jun 27, 2023
77c2744
More clarifications of Node.js specific parts
jdapena Jun 27, 2023
9eac65c
Added Joyee Cheung to the initial paragraph list of contributors
jdapena Jun 27, 2023
53d996a
More accurate description of --heapsnapshot-near-heap-limit implement…
jdapena Jun 28, 2023
a0518ff
More details of source position caching fix
jdapena Jun 28, 2023
f02741a
Reduce ambiguity of development and production JS code reference
jdapena Jun 28, 2023
bc0bcee
Rewrite description of V8 heap limit configuration.
jdapena Jun 28, 2023
91525cd
Apply suggestions from Joyee code review
jdapena Jul 11, 2023
69a958e
Remove unneeded reference from heap snapshot optimization post
jdapena Jul 11, 2023
1689958
Crop images to improve readability
jdapena Jul 11, 2023
3a1cdd9
Apply suggestions from @syg code review
jdapena Jul 12, 2023
91dc5a8
Apply suggestions from @syg in pull request.
jdapena Jul 12, 2023
3899164
Added @syg to the list of contributors in first paragraph
jdapena Jul 12, 2023
de70d28
Do not use specific names explaining the heap snapshot performance in…
jdapena Jul 12, 2023
3811c74
Remove sponsorship reference in heap optimizations post
jdapena Jul 12, 2023
c89f34f
Provided function snippet and value examples of new Hash function
jdapena Jul 13, 2023
239123a
speeding-up-heap-snapshots.md: apply suggestions from Joyee code review
jdapena Jul 18, 2023
3785dbe
Update speeding-up-v8-heap-snapshot.md: replace obscure reference to …
jdapena Jul 18, 2023
c603786
speeding-up-v8-snapshots.md: following recommendations from Joyee, si…
jdapena Jul 18, 2023
979a434
speeding-up-heap-snapshots.md: remove unused image
jdapena Jul 18, 2023
ca5bf19
speeding-up-v8-heap-snapshots.md: apply suggestions from @syg
jdapena Jul 19, 2023
bdd5b96
speeding-up-v8-heap-snapshots.md: manually apply @syg proposed changes
jdapena Jul 19, 2023
8f083a8
speeding-up-v8-heap-snapshots.md: update publication date
jdapena Jul 19, 2023
b065c43
speeding-up-v8-heap-snapshots.md: drop WPA session
jdapena Jul 26, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Binary file added src/_img/speeding-up-v8-heap-snapshots/img1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/_img/speeding-up-v8-heap-snapshots/img2.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
190 changes: 190 additions & 0 deletions src/blog/speeding-up-v8-heap-snapshots.md
@@ -0,0 +1,190 @@
---
title: 'Speeding up V8 heap snapshots'
description: 'This post about V8 heap snapshots presents some performance problems found by Bloomberg engineers, and how we fixed them to make JavaScript memory analysis faster than ever.'
author: 'Jose Dapena Paz'
date: 2023-06-23
tags:
- memory
- tools
---
*This blog post has been authored by José Dapena Paz (Igalia), with contributions from Jason Williams (Bloomberg), Ashley Claymore (Bloomberg), Rob Palmer (Bloomberg), Joyee Cheung (Igalia) and Shu-yu Guo (Google).*

In this post about V8 heap snapshots, I will talk about some performance problems found by Bloomberg engineers, and how we fixed them to make JavaScript memory analysis faster than ever.

## The Problem

Bloomberg engineers were working on diagnosing a memory leak in a JavaScript application. It was failing with *Out-Of-Memory* errors. For the tested application, the V8 heap limit was configured to be around 1400MB. Normally V8's garbage collector should be able to keep the heap usage under that limit, so the failures indicated that there was likely a leak.

A common technique to debug a routine memory leak scenario like this is to capture a heap snapshot first, then load it in the DevTools "Memory" tab and find out what is consuming the most memory by inspecting the various summaries and object attributes. In the DevTools UI, the heap snapshot can be taken in the "Memory" tab. For Node.js applications, the heap snapshot [can be triggered programmatically](https://nodejs.org/en/docs/guides/diagnostics/memory/using-heap-snapshot) using this API:

```javascript
require('v8').writeHeapSnapshot();
```

They wanted to capture several snapshots at different points in the application's life, so that DevTools Memory viewer could be used to show the difference between the heaps at different times. The problem was that capturing a single full-size (500MB) snapshot was taking **over 30 minutes** alone!
jdapena marked this conversation as resolved.
Show resolved Hide resolved

It was this slowness in the memory analysis workflow that we needed to solve.

## Narrowing the Problem

Then, Bloomberg engineers started investigating the issue using some V8 parameters. As described in the previous post, V8 has some nice command line parameters that can help with that. These options were used to create the heap snapshots, simplify the reproduction, and improve observability:
jdapena marked this conversation as resolved.
Show resolved Hide resolved

`--max-old-space-size=100`
: This limits the heap to 100 megabytes and helps to reproduce the issue much faster.

`--heapsnapshot-near-heap-limit=10`
jdapena marked this conversation as resolved.
Show resolved Hide resolved
: This is a Node.js specific command line parameter that tells Node.js to generate a snapshot each time it comes close to running out of memory. It is configured to generate up to 10 snapshots in total. This prevents thrashing where the memory-starved program spends a long time producing more snapshots than needed.

`--enable-etw-stack-walking`
: This allows tools such as ETW, WPA & xperf to see the JS stack which has been called in V8. (available in Node.js v20+)

`--interpreted-frames-native-stack`
: This flag is used in combination with tools like ETW, WPA & xperf to see the native stack when profiling. (available in Node.js v20+).

When V8 is close to the heap limit, it forces a garbage collection to reduce the memory usage and avoid hitting the limit. It also notifies the embedder that the heap is about to reach the memory limit. The `--heapsnapshot-near-heap-limit` flag in Node.js dumps a new heap snapshot upon notification. In the test case, the memory usage decreases, but, after several iterations, garbage collection ultimately can not free up enough space and so the application is terminated with an *Out-Of-Memory* error.
jdapena marked this conversation as resolved.
Show resolved Hide resolved

They took recordings using Windows Performance Analyzer (see below) in order to narrow down the issue. This revealed that most CPU time was being spent within the V8 Heap Explorer. Specifically, it took around 30 minutes just to walk through the heap to visit each node and collect the name. This didn’t seem to make much sense - why would recording the name of each property take so long?

This is when I was asked to take a look.
jdapena marked this conversation as resolved.
Show resolved Hide resolved

## Quantifying the Problem

The first step was adding support in V8 to better understand where time is spent during the capturing of heap snapshots. The capture process itself is split into two phases: generation, then serialization. We landed [this patch](https://chromium-review.googlesource.com/c/v8/v8/+/4428810) upstream to introduce a new command line flag `--profile_heap_snapshot` to V8, which enables logging of both the generation and serialization times.

Using this flag, we learned some interesting things!

First, we could observe the exact time the CPU was using for each snapshot. In our reduced test case, the first took 5 minutes, the second took 8 minutes, and each subsequent snapshot kept on taking longer and longer. Nearly all of this time was spent in the generation phase.
jdapena marked this conversation as resolved.
Show resolved Hide resolved

We also identified other widely-used JavaScript applications that demonstrated the same slowdown, in particular, running ESLint on TypeScript. The problem was not app-specific.
jdapena marked this conversation as resolved.
Show resolved Hide resolved

Furthermore, we found the problem happened on both Windows and Linux. The problem was also not platform-specific.

## Windows Performance Analyzer to the rescue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this section adds to the article.


As the problem was initially reported on a Windows platform, I used [Windows Performance Toolkit](https://learn.microsoft.com/en-us/windows-hardware/test/wpt/), based on [ETW](https://learn.microsoft.com/en-us/windows-hardware/drivers/devtest/event-tracing-for-windows--etw-), for analysis. This is a powerful low-level expert tool to find out exactly what a program is doing on Windows.

To record the session, I followed these steps:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section seems overly detailed to me. The upshot seems to be that there's a Windows-specific profiler that gave you good leads, which I think suffices. The meat of this is what you found out from those leads and how you optimized the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I wanted to be explicit about the procedure used for the investigation, so other people can benefit from it investigating performance problems in V8 (and not only V8) scope.

Even when Windows Performance Toolkit documentation is not bad, the specific steps to investigate this V8 problem are not completely trivial. The actual solutions to the detected problems were not really complex, but identifying exactly where the overhead was has been more significant. And for that, the step-by-step looks valuable to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the upside of this is that this teaches people how to address similar issues and would help future contributors. But one down side is that this is very Windows-specific. Maybe we can save the instructions about the exact buttons to click, because the links to the Microsoft documentation are already more than enough for people who wants to get their hands dirty. And we can focus on what to look (so people who are using e.g. Linux perf can apply it to their tool as well) instead, and perhaps touch on V8/Node.js flags to use for lack of better documentation on them. So maybe something like:

To analyze the performance of heap snapshot generation, I followed [these steps](https://learn.microsoft.com/en-us/windows-hardware/test/wpt/wpr-how-to-topics#start-a-recording) to run the Windows Performance Recorder with the failing script using `NODE_OPTIONS="--max-old-space-size=100 --heapsnapshot-near-heap-limit=10 --profile-heap-snapshot`. I had to modify Node.js to accept `--profile-heap-snapshot` in `NODE_OPTIONS`, as it uses an allowlist to filter V8 flags that can be configured through the environment variable.

To make this more succinct, we can also move the Narrowing the Problem section (or just the explanation about the flags) here - the first step was adding the flag, which gave us a rough picture of the time spent on snapshot generation. In terms of the readers' understanding, ETW does not come into the picture until we start the second step, so we do not have to explain about the flags or introduce ETW/Windows performance toolkit before we finish talking about the first step (the fact that we used ETW before we added the flag was a less relevant background info for readers, we really only put them into good use in the second step) . It also seems that the Narrowing the Problem left a lot of gaps that are not filled until we reach this section, and when we finally reach here we have to again repeat some information mentioned before, so we might as well just merge them here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is still interesting to add how to use the specific Windows tooling. It is not trivial to take the right steps with it to achieve a good recording that is informative enough.

Main issue I see with what I wrote here is that, one thing is explaining how to use the tool for this specific problem, and another different one is being a step-by-step manual of how to use WPR. This post should refer to the link you propose for the "step-by-step" guide, and only add hints of the specific configuration used.

I will rewrite this accordingly.

The gaps in "Narrowing the problem" are intentional, as they are explaining the investigation process. We come from scattered facts to getting more information (in this case first with logging, then profiling) that allows to resolve the problem. I would prefer to keep the flow like that.


1. Opened [Windows Performance Recorder](https://learn.microsoft.com/en-us/windows-hardware/test/wpt/windows-performance-recorder) and selected CPU profiling, verbose detail level, general scenario, and file logging mode.

![](/_img/speeding-up-v8-heap-snapshots/img1.png){.no-darkening}


2. After that, I started the recording session (pressing the Start button).
3. Then, I executed the failing script with `NODE_OPTIONS="--max-old-space-size=100 --heapsnapshot-near-heap-limit=10 --profile-heap-snapshot`. I had to modify Node.js to accept `--profile-heap-snapshot` in `NODE_OPTIONS`, as it uses an allowlist to filter V8 flags that can be configured through the environment variable.
4. I just let it run to generate a couple of heap snapshots (it would already take over 10 minutes!) and then I stopped the recording.

## First Optimization: Improved StringsStorage hashing

When I opened the recording with [Windows Performance Analyzer](https://learn.microsoft.com/en-us/windows-hardware/test/wpt/windows-performance-analyzer), this was what I found:

![](/_img/speeding-up-v8-heap-snapshots/img2.png){.no-darkening}


One third of the samples was spent in `v8::internal::StringsStorage::GetEntry`:

```javascript
jdapena marked this conversation as resolved.
Show resolved Hide resolved
181 base::HashMap::Entry* StringsStorage::GetEntry(const char* str, int len) {
182 uint32_t hash = ComputeStringHash(str, len);
183 return names_.LookupOrInsert(const_cast<char*>(str), hash);
184 }
```

Because this was run with a release build, the information of the inlined function calls were folded into `StringsStorage::GetEntry()`. To figure out exactly how much time the inlined function calls were taking, I added the "Source Line Number" column to the breakdown and found that most of the time was spent on line 182, which was a call to `ComputeStringHash()`:

![](/_img/speeding-up-v8-heap-snapshots/img3.png){.no-darkening}

So over 30% of the snapshot generation time was spent on `ComputeStringHash()`, but why?

Let’s first talk about `StringsStorage`. Its purpose is to store a unique copy of all the strings that will be used in the heap snapshot. For fast access and avoiding duplicates, this class uses a flatmap: a hashmap backed by an array, where collisions are handled by storing elements in the next position of the array.
jdapena marked this conversation as resolved.
Show resolved Hide resolved

I started to suspect that the problem could be caused by collisions, which could lead to long searches in the array. But I needed to prove it. So I added exhaustive logs to see the generated hash keys and, on insertion, see how far it was between the expected position calculated from the hash key and the actual position the entry ended up in due to collisions.
jdapena marked this conversation as resolved.
Show resolved Hide resolved

In the logs, things were… not right: the offset of many items was over 20, and in the worst case, in the order of thousands!

Part of the problem was caused by the scripts using strings for lots of numbers - and especially a big range of numbers from 1 to several hundreds without gaps. The hash key algorithm had two implementations, one for numbers and another for other strings. While the string hash function was quite classical, the implementation for the numbers would basically return the value of the number prefixed by the number of digits:
jdapena marked this conversation as resolved.
Show resolved Hide resolved

```
ValueBits := 24
Mask := (1 << ValueBits) - 1 /* 0xffffff */
OriginalHash(x) := (digits(x) << ValueBits) | (x & Mask)
```

| x | OriginalHash(x) |
| -: | -: |
| 0 | `0x1000000` |
| 1 | `0x1000001` |
| 2 | `0x1000002` |
| 3 | `0x1000003` |
| 10 | `0x200000a` |
| 11 | `0x200000b` |
| 100 | `0x3000064` |

This function was problematic. Some examples of problems with this hash function:

- Once we inserted a string whose hash key value was a small number, we would run into collisions when we tried to store another number in that location, and there would be similar collisions if we tried to store subsequent numbers consecutively.
- Or even worse: if there were already a lot of consecutive numbers stored in the map, and we wanted to insert a string whose hash key value was in that range, we had to move the entry along all the occupied locations to find a free location.

What did I do to fix it? As the problem comes mostly from numbers represented as strings that would fall in consecutive positions, I modified the hash function so we would rotate the resulting hash value 2 positions to the left.
jdapena marked this conversation as resolved.
Show resolved Hide resolved

```
NewHash(x) := OriginalHash(x) << 2
``````
jdapena marked this conversation as resolved.
Show resolved Hide resolved

| x | OriginalHash(x) | NewHash(x) |
| -: | -: | -: |
| 0 | `0x1000000` | `0x4000000` |
| 1 | `0x1000001` | `0x4000004` |
| 2 | `0x1000002` | `0x4000008` |
| 3 | `0x1000003` | `0x400000c` |
| 10 | `0x200000a` | `0x8000028` |
| 11 | `0x200000b` | `0x800002c` |
jdapena marked this conversation as resolved.
Show resolved Hide resolved

So, for each number, we would introduce 3 free positions. Why rotating 2 bits? Empirical testing across several work-sets showed this number was the best choice to minimize collisions.
jdapena marked this conversation as resolved.
Show resolved Hide resolved

[This hashing fix](https://chromium-review.googlesource.com/c/v8/v8/+/4428811) has landed in V8.

## Second Optimization: Caching source positions

After fixing the hashing, we re-profiled and found a further optimization opportunity that would reduce a significant part of the overhead.

For each allocation in the V8 heap, the heap snapshot tries to record the call stack responsible for the allocation. Therefore, for each stack frame, it needs to know the function name and its source location (filename, line number, column number). It was the fetching of this information that turned out to be super slow!
jdapena marked this conversation as resolved.
Show resolved Hide resolved

What was happening? It was something similar to [what I fixed in the ETW stack walk](https://chromium-review.googlesource.com/c/v8/v8/+/3669698) and that I explained in [this post](https://blogs.igalia.com/dape/2022/12/21/native-call-stack-profiling-3-3-2022-work-in-v8/).
jdapena marked this conversation as resolved.
Show resolved Hide resolved

To compute the source code lines of a function, V8 knows the linear position of the function in the script. But, to find the source line, it needs to traverse the whole script to identify where each newline occurs. This is expensive.
jdapena marked this conversation as resolved.
Show resolved Hide resolved

When requesting line information, V8 already implements caching of the source line positions per-script in method `Script::InitLineEnds()`. That information is stored in each script using a new heap object. Unfortunately, the snapshot implementation cannot modify the heap when traversing it, so the newly calculated line information cannot be cached.
jdapena marked this conversation as resolved.
Show resolved Hide resolved

The solution? Before generating the heap snapshot, we now iterate over all the scripts in the V8 context to compute and cache the source line positions. As this is not done when we traverse the heap for heap snapshot generation, it is still possible to modify the heap and store the source line positions as a cache.
jdapena marked this conversation as resolved.
Show resolved Hide resolved

[This source position caching fix](https://chromium-review.googlesource.com/c/v8/v8/+/4538766) has also landed in V8.
jdapena marked this conversation as resolved.
Show resolved Hide resolved

## Did we make it fast?

After enabling both fixes, we re-profiled. Both of our fixes only affect snapshot generation time, so, as expected, snapshot serialization times were unaffected.

When operating on a JS program containing...

- Development JS, generation time is **50% faster** 👍
- Production JS, generation time is **90% faster** 😮

Why was there a massive difference between production and development code? The production code is optimized using bundling and minification, so there are fewer JS files, and these files tend to be large. It takes longer to calculate source lines positions for these large files, so they benefit the most when we can cache the source position and avoid repeating calculations.

The optimizations were validated on both Windows and Linux target environments.

For the particularly challenging problem originally faced by the Bloomberg engineers, the total end-to-end time to capture a 100MB snapshot was reduced from a painful 10 minutes down to a very pleasant 6 seconds. That is **a 100x win!** 🔥

The optimizations are generic wins that we expect to be widely applicable to anyone performing memory debugging on V8, Node.js, and Chromium. These wins were shipped in V8 11.5.130, which means they are found in Chromium 115.0.5576.0. We look forward to Node.js gaining these optimizations in the next semver-major release.

## What’s next?

First, it would be useful for Node.js to accept the new `--profile-heap-snapshot` flag in `NODE_OPTIONS`. In some use cases, users cannot control the the command line options passed to Node.js directly and have to configure them through the environment variable `NODE_OPTIONS`. Today, Node.js filters V8 command line options set in the environment variable, and only allows a known subset, which could make it harder to test new V8 flags in Node.js, as what happened in our case.

Information accuracy in snapshots can be improved further. Today, each script source code line information is stored in a representation in the V8 heap itself. And that’s a problem because we want to measure the heap precisely without the performance measurement overhead affecting the subject we are observing. Ideally, we would store the cache of line information outside the V8 heap in order to make heap snapshot information more accurate.

Finally, now that we improved the generation phase, the biggest cost is now the serialization phase. Further analysis may reveal new optimization opportunities in serialization.

## Credits

This was possible thanks to the work of [Igalia](https://www.igalia.com) and [Bloomberg](https://techatbloomberg.com) engineers.