Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when creating objects in Ruby #10106

Closed
djudd opened this issue Jun 3, 2022 · 9 comments
Closed

Memory leak when creating objects in Ruby #10106

djudd opened this issue Jun 3, 2022 · 9 comments
Assignees
Labels

Comments

@djudd
Copy link

djudd commented Jun 3, 2022

What version of protobuf and what language are you using?
Version: 3.21.1
Language: Ruby

What operating system (Linux, Windows, ...) and version?
Linux, OS X

What runtime / compiler are you using (e.g., python version or gcc version)
Ruby 2.7

What did you do?
Steps to reproduce the behavior:

  1. Clone or download https://github.com/djudd/proto-leak-repro
  2. bundle install
  3. bundle exec repro.rb

What did you expect to see
Creating and then throwing away Ruby Proto objects and then invoking the garbage collector should add at most trivial amounts of persistent memory usage in steady state.

What did you see instead?
If we populate a request object which contains a reference to a static object as well as an array of many transient objects, the process retains memory proportional to the many transient objects even after they are GCd, at least as long as we retain a reference to the static object.

(This is a bit abstract but hopefully the linked script makes it clear.)

Anything else we should know about your project / environment
The linked GitHub repository contains a ~minimal reproduction but this is something we first saw over the course of hours in a staging environment, which is evidence against the symptoms shown in the script being just a matter of not reaching steady-state memory usage when doing a lot of allocations, or anything along those lines.

The leak does not appear to be at the Ruby level, e.g. analysis of a heap dump with https://github.com/Shopify/heap-profiler does not show meaningful differences in what's present in the visible-to-Ruby heap, even when the leak reproduction is used to create a 2-4x increase in RSS.

Note that I've tested the reproduction on 3.21.1 which should include the fix in #9586

@fowles
Copy link
Member

fowles commented Jun 14, 2022

This is a known issue caused by the arena-fuse operation of reparenting operations. The best workaround is to force a copy of static objects when you tie them into an outgoing request. I don't know off hand if the ruby API exposes such a copy.

@ericsalo
Copy link
Member

To the best of my knowledge there is no direct interfact in Ruby for directly cloning a protobuf like that. Indirectly you could make your own copy by first serializing and then deserializing the object in question, which while somewhat inelegant should at least fix the memory leak. I think.

@haberman
Copy link
Member

There is Google::Protobuf.deep_copy(m):

def test_deep_copy
m = proto_module::TestMessage.new(:optional_int32 => 42,
:repeated_msg => [proto_module::TestMessage2.new(:foo => 100)])
m2 = Google::Protobuf.deep_copy(m)
assert m == m2
assert m.repeated_msg == m2.repeated_msg
assert m.repeated_msg.object_id != m2.repeated_msg.object_id
assert m.repeated_msg[0].object_id != m2.repeated_msg[0].object_id
end

For shallow copies you can simply use #dup:

@ahayworth
Copy link

ahayworth commented Aug 4, 2022

Hi @ericsalo @fowles - I brought this issue up to a coworker today, because we are experiencing a similar issue internally at Shopify. We may be able to use the workaround you mentioned.

However: are there any plans to fix this issue at all? Even though perhaps one shouldn't create and then throw away objects, strictly speaking this is something that should absolutely be supported in a garbage-collected language. If you take a closer look at the reproduction case that @djudd provided, you can see that they are using the ruby OpenTelemetry libraries, which I presume was intentional - and since I'm one of the people who work on that library, that means this issue is now even more important to me. 😝

Consider the tracing use case for a moment. In a high volume service, one could imagine how hundreds of thousands of objects could be allocated quite rapidly in the service of constructing spans that take part in the distributed trace - and one could easily imagine how such objects would then be unused (and ready for garbage collection) rather quickly. One could also imagine how many of those objects would be substantially similar - or at least built from substantially similar parts. Finally, one could easily see how a ruby programmer (expecting that things are in fact garbage collected) wouldn't think twice about relying on that assumption and creating objects willy-nilly. 😆

I somewhat suspect that @djudd is experiencing a performance problem related to using the ruby OpenTelemetry libraries, and as such it's probably not one they could easily fix since it's in a dependency. Of course, the ruby working group will look into using the workaround you mentioned - but I have to say as a programmer in a garbage-collected language I expect the dependencies I rely on to not hamstring the garbage collector to this extent. (Update: the ruby working group has fixed this particular problem already, although perhaps there is another problem lurking)

This is actually quite a "sharp edge" to find buried in an issue, and not front-and-center on the ruby protobuf documentation: there should be a big warning that one cannot create and throw away objects with this library because they may not be garbage collected, so that users can decide whether or not taking on the dependency is even worth it to them. The workaround should also be on that warning so that those who have little choice (for example, those of use who need to use the library to adhere to OpenTelemetry spec compliance 😆) would at least know about this pitfall and be able to design accordingly - so that we can spare our users that pain. I personally have experienced a lot of "sharp edges" with the ruby library, and I can say definitively that my employer has as well - and it's frustrating. 😢

Is this one going to be fixed?


@djudd - if you are experiencing protobuf-related performance issues in the ruby otel libraries, please come talk to us. We'd like to learn more. ❤️

@haberman
Copy link
Member

haberman commented Aug 4, 2022

Hi Andrew,

However: are there any plans to fix this issue at all? Even though perhaps one shouldn't create and then throw away objects, strictly speaking this is something that should absolutely be supported in a garbage-collected language.

There's an important distinction here I want to be sure is clear.

I agree that creating and destroying many objects rapidly is an important use case, and the library is actually optimized for this use case. Code like this is efficient and will be precisely GC'd:

# Creates many objects.
msg = FooMessage.decode(enormous_payload)

# Frees them all at once, very efficiently.
msg = nil

In this example, the decode operation may parse hundreds or thousands of message objects off the wire. It will put them into a a single arena. When the messages are no longer used, the next GC pass will immediately free the entire arena, with no memory leak and fast performance.

The problem only arises when you are mixing objects that have very different lifetimes. In the repro at https://github.com/djudd/proto-leak-repro/blob/main/repro.rb, the pattern was essentially this:

long_lived_scope = Opentelemetry::Proto::Common::V1::InstrumentationScope.new.freeze

iters.times {
  short_lived_spans = Opentelemetry::Proto::Trace::V1::ScopeSpans.new(
    scope: long_lived_scope,  # This joins the two lifetimes together.
    spans: iters.times.map {Opentelemetry::Proto::Trace::V1::Span.new}
  )
}

When you join two objects like this, protobuf will join their lifetimes. If their lifetimes were similar to begin with, this is no problem. But if the lifetimes are very different, this will inhibit the collection of the short-lived objects.

I agree that this can be surprising, and our docs should cover this better. The workarounds are relatively simple, but users cannot use those workarounds unless they know about them.

there should be a big warning that one cannot create and throw away objects with this library because they may not be garbage collected

I think that would be too strong of a warning. For the most common patterns (eg. request/response processing), users can create and throw away as many objects as they want, and GC will work as expected. It's only when joining protos with very different lifetimes that some caution is required.

Is this one going to be fixed?

This behavior is intrinsic to the arena-based design. We moved to the arena design in v3.15.0, 18 months ago, because it significantly reduced GC overhead for large messages. We had users who were experiencing significant problems with GC overhead for large message trees, and the new design fixed those problems.

I don't want to downplay the frustration you experienced, but I think some clear documentation of the problematic pattern and the (relatively simple) workarounds will go a long way. A small code change can fix the problem completely, if we can just explain to users what that code change is.

@haberman
Copy link
Member

haberman commented Aug 4, 2022

One other thought: if OpenTelemetry users encounter this corner case more often than other users, due to the nature of how traces are often built from long-lived globals, it seems like a note about this in the OpenTelemetry docs for Ruby could also go a long way. Once we have added good docs for this on our side, your docs could link to our docs on the issue.

@djudd
Copy link
Author

djudd commented Aug 4, 2022

@ahayworth you're correct that this is something I originally experienced with the OpenTelemetry libraries; we ran into this issue at Stripe while migrating our tracing client to OpenTelemetry.[0] Fortunately, it surfaced in a staging environment. I spent a week or two narrowing leak-like symptoms down to this issue; as @haberman suggests the fix was straightforward,[1] but it took some time to find, because of course I first assumed I'd introduced an actual leak at the Ruby level in code I'd written myself.

I think part of the footgun here is that the recommended pattern here is the opposite of what you'd normally want to do for optimal performance in Ruby. I introduced the bug by, out of habit, initializing an InstrumentationScope object once at startup rather than re-initializing it for every request; I had no particular data to motivate this but avoiding allocations is a generally pretty reliably positive for performance-sensitive Ruby code. And then the issue is compounded by the fact that the reference from the static object to the transient objects via the arena is invisible to normal Ruby code, e.g. appearing even if you recursively freeze the static object.

I'm not sure what the right change would be at the library level here, if any, but one idea would be to expose an API to control the arena behavior at the Ruby level, e.g. support something like Google::Protobuf::Arena.with_arena {...} that would override the automatic association of objects to arenas, if that's possible. Relatedly I think there's a case that using arenas should be opt-in, since the failure mode without arenas is to be slow in a pretty normal and obvious way, while the failure mode with arenas is a bit subtler.

——
[0] More precisely, we migrated span export to use code generated directly from the Proto IDL; we didn't use the OpenTelemetry gem. Partly that was so that it would play nicely with our existing service discovery & auth tooling, and partly it was about the overhead of instrumentation; I'd be happy to talk more about this somewhere @ahayworth but it seems off topic for this issue.

[1] As far as I know. I left Stripe shortly after working on this so I can't say whether more related issues have come up as the rollout continued.

@fowles
Copy link
Member

fowles commented Aug 4, 2022

I actually think we should investigate heuristic defenses against this at the runtime layer

@ahayworth
Copy link

@haberman thank you for the detailed explanation! That does bring a lot of clarity to the issue.

And yes - I'm glad you sensed my frustration (it's frustrating!) but we don't have to let my frustration leak all the way through to the big, scary warning. A much more modest and restrained warning would suffice. 😂

As @djudd mentioned, normally when optimizing Ruby code you look to reduce allocations, so I definitely agree that covering this in the docs will go a long way. I would much prefer that the surprising behavior simply didn't happen, but I understand why the arena allocation was introduced (and that's not an easy balance to strike).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants