New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak when creating objects in Ruby #10106
Comments
This is a known issue caused by the arena-fuse operation of reparenting operations. The best workaround is to force a copy of static objects when you tie them into an outgoing request. I don't know off hand if the ruby API exposes such a copy. |
To the best of my knowledge there is no direct interfact in Ruby for directly cloning a protobuf like that. Indirectly you could make your own copy by first serializing and then deserializing the object in question, which while somewhat inelegant should at least fix the memory leak. I think. |
There is protobuf/ruby/tests/common_tests.rb Lines 692 to 700 in 8e7f936
For shallow copies you can simply use protobuf/ruby/tests/common_tests.rb Line 922 in 8e7f936
|
Hi @ericsalo @fowles - I brought this issue up to a coworker today, because we are experiencing a similar issue internally at Shopify. We may be able to use the workaround you mentioned. However: are there any plans to fix this issue at all? Even though perhaps one shouldn't create and then throw away objects, strictly speaking this is something that should absolutely be supported in a garbage-collected language. If you take a closer look at the reproduction case that @djudd provided, you can see that they are using the ruby OpenTelemetry libraries, which I presume was intentional - and since I'm one of the people who work on that library, that means this issue is now even more important to me. 😝 Consider the tracing use case for a moment. In a high volume service, one could imagine how hundreds of thousands of objects could be allocated quite rapidly in the service of constructing spans that take part in the distributed trace - and one could easily imagine how such objects would then be unused (and ready for garbage collection) rather quickly. One could also imagine how many of those objects would be substantially similar - or at least built from substantially similar parts. Finally, one could easily see how a ruby programmer (expecting that things are in fact garbage collected) wouldn't think twice about relying on that assumption and creating objects willy-nilly. 😆 I somewhat suspect that @djudd is experiencing a performance problem related to using the ruby OpenTelemetry libraries, and as such it's probably not one they could easily fix since it's in a dependency. Of course, the ruby working group will look into using the workaround you mentioned - but I have to say as a programmer in a garbage-collected language I expect the dependencies I rely on to not hamstring the garbage collector to this extent. (Update: the ruby working group has fixed this particular problem already, although perhaps there is another problem lurking) This is actually quite a "sharp edge" to find buried in an issue, and not front-and-center on the ruby protobuf documentation: there should be a big warning that one cannot create and throw away objects with this library because they may not be garbage collected, so that users can decide whether or not taking on the dependency is even worth it to them. The workaround should also be on that warning so that those who have little choice (for example, those of use who need to use the library to adhere to OpenTelemetry spec compliance 😆) would at least know about this pitfall and be able to design accordingly - so that we can spare our users that pain. I personally have experienced a lot of "sharp edges" with the ruby library, and I can say definitively that my employer has as well - and it's frustrating. 😢 Is this one going to be fixed? @djudd - if you are experiencing protobuf-related performance issues in the ruby otel libraries, please come talk to us. We'd like to learn more. ❤️ |
Hi Andrew,
There's an important distinction here I want to be sure is clear. I agree that creating and destroying many objects rapidly is an important use case, and the library is actually optimized for this use case. Code like this is efficient and will be precisely GC'd: # Creates many objects.
msg = FooMessage.decode(enormous_payload)
# Frees them all at once, very efficiently.
msg = nil In this example, the The problem only arises when you are mixing objects that have very different lifetimes. In the repro at https://github.com/djudd/proto-leak-repro/blob/main/repro.rb, the pattern was essentially this: long_lived_scope = Opentelemetry::Proto::Common::V1::InstrumentationScope.new.freeze
iters.times {
short_lived_spans = Opentelemetry::Proto::Trace::V1::ScopeSpans.new(
scope: long_lived_scope, # This joins the two lifetimes together.
spans: iters.times.map {Opentelemetry::Proto::Trace::V1::Span.new}
)
} When you join two objects like this, protobuf will join their lifetimes. If their lifetimes were similar to begin with, this is no problem. But if the lifetimes are very different, this will inhibit the collection of the short-lived objects. I agree that this can be surprising, and our docs should cover this better. The workarounds are relatively simple, but users cannot use those workarounds unless they know about them.
I think that would be too strong of a warning. For the most common patterns (eg. request/response processing), users can create and throw away as many objects as they want, and GC will work as expected. It's only when joining protos with very different lifetimes that some caution is required.
This behavior is intrinsic to the arena-based design. We moved to the arena design in v3.15.0, 18 months ago, because it significantly reduced GC overhead for large messages. We had users who were experiencing significant problems with GC overhead for large message trees, and the new design fixed those problems. I don't want to downplay the frustration you experienced, but I think some clear documentation of the problematic pattern and the (relatively simple) workarounds will go a long way. A small code change can fix the problem completely, if we can just explain to users what that code change is. |
One other thought: if OpenTelemetry users encounter this corner case more often than other users, due to the nature of how traces are often built from long-lived globals, it seems like a note about this in the OpenTelemetry docs for Ruby could also go a long way. Once we have added good docs for this on our side, your docs could link to our docs on the issue. |
@ahayworth you're correct that this is something I originally experienced with the OpenTelemetry libraries; we ran into this issue at Stripe while migrating our tracing client to OpenTelemetry.[0] Fortunately, it surfaced in a staging environment. I spent a week or two narrowing leak-like symptoms down to this issue; as @haberman suggests the fix was straightforward,[1] but it took some time to find, because of course I first assumed I'd introduced an actual leak at the Ruby level in code I'd written myself. I think part of the footgun here is that the recommended pattern here is the opposite of what you'd normally want to do for optimal performance in Ruby. I introduced the bug by, out of habit, initializing an InstrumentationScope object once at startup rather than re-initializing it for every request; I had no particular data to motivate this but avoiding allocations is a generally pretty reliably positive for performance-sensitive Ruby code. And then the issue is compounded by the fact that the reference from the static object to the transient objects via the arena is invisible to normal Ruby code, e.g. appearing even if you recursively freeze the static object. I'm not sure what the right change would be at the library level here, if any, but one idea would be to expose an API to control the arena behavior at the Ruby level, e.g. support something like —— [1] As far as I know. I left Stripe shortly after working on this so I can't say whether more related issues have come up as the rollout continued. |
I actually think we should investigate heuristic defenses against this at the runtime layer |
@haberman thank you for the detailed explanation! That does bring a lot of clarity to the issue. And yes - I'm glad you sensed my frustration (it's frustrating!) but we don't have to let my frustration leak all the way through to the big, scary warning. A much more modest and restrained warning would suffice. 😂 As @djudd mentioned, normally when optimizing Ruby code you look to reduce allocations, so I definitely agree that covering this in the docs will go a long way. I would much prefer that the surprising behavior simply didn't happen, but I understand why the arena allocation was introduced (and that's not an easy balance to strike). |
What version of protobuf and what language are you using?
Version: 3.21.1
Language: Ruby
What operating system (Linux, Windows, ...) and version?
Linux, OS X
What runtime / compiler are you using (e.g., python version or gcc version)
Ruby 2.7
What did you do?
Steps to reproduce the behavior:
bundle install
bundle exec repro.rb
What did you expect to see
Creating and then throwing away Ruby Proto objects and then invoking the garbage collector should add at most trivial amounts of persistent memory usage in steady state.
What did you see instead?
If we populate a request object which contains a reference to a static object as well as an array of many transient objects, the process retains memory proportional to the many transient objects even after they are GCd, at least as long as we retain a reference to the static object.
(This is a bit abstract but hopefully the linked script makes it clear.)
Anything else we should know about your project / environment
The linked GitHub repository contains a ~minimal reproduction but this is something we first saw over the course of hours in a staging environment, which is evidence against the symptoms shown in the script being just a matter of not reaching steady-state memory usage when doing a lot of allocations, or anything along those lines.
The leak does not appear to be at the Ruby level, e.g. analysis of a heap dump with https://github.com/Shopify/heap-profiler does not show meaningful differences in what's present in the visible-to-Ruby heap, even when the leak reproduction is used to create a 2-4x increase in RSS.
Note that I've tested the reproduction on 3.21.1 which should include the fix in #9586
The text was updated successfully, but these errors were encountered: