Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another Out-Of-Memory Error on Poseidon #453

Closed
MrSerth opened this issue Sep 17, 2023 · 14 comments
Closed

Another Out-Of-Memory Error on Poseidon #453

MrSerth opened this issue Sep 17, 2023 · 14 comments
Labels
bug Something isn't working deployment Everything related to our production environment

Comments

@MrSerth
Copy link
Member

MrSerth commented Sep 17, 2023

We are experiencing another out-of-memory error on Poseidon, that should be investigated further.

I don't have many details to share, yet. The issues appeared first on Sunday, September 17th with release 39fc0f9. After rolling back to the previous commit 68cd8f4 (around 3:30pm CEST), the issue did not occur another time. This could point to one of the dependencies, but I am not sure about that either.

Bildschirmfoto 2023-09-17 um 15 40 43
@MrSerth MrSerth added bug Something isn't working deployment Everything related to our production environment labels Sep 17, 2023
@MrSerth
Copy link
Member Author

MrSerth commented Sep 18, 2023

We have another occurrence of an OOM kill just from this morning (Sep 18 06:38:45). See more details with cat /var/log/syslog | grep -A 50 -B 50 OOM.

@mpass99
Copy link
Collaborator

mpass99 commented Sep 19, 2023

Unfortunately, Poseidon's Log does not really help identify the memory leak. During the OOM kill, three actions were happening:

  1. One running execution (next to 60338 others that day)
  2. One runner stopped due to an inactivity timer (which happened also 5136 other times this day without problems)
  3. One allocation event handling (23944 times that day)

Accordingly, I was also not able to reproduce the issue in the staging environment.
#457 might help identify the leak.

@MrSerth
Copy link
Member Author

MrSerth commented Sep 19, 2023

Unfortunately, Poseidon's Log does not really help identify the memory leak.

I feared that, but thanks for looking into it nevertheless. Besides the PR #457 you've just created, do we have a chance to analyze a core dump or (performance) profile? I am still hoping it would be possible to get a memory dump just before the memory is cleared, which might further assist us in debugging the cause.

@mpass99
Copy link
Collaborator

mpass99 commented Sep 20, 2023

I haven't found an automated core dump for OOM killed processes [1].

Anyway, our custom memory dump (#457) identified the cause: Sentry Profiling [2] [3].
To be precise, the calculation of the buckets when finishing a span [4] (82a00ab2).

buckets = make([]*profileSamplesBucket, 0, int64((relativeEndNS-relativeStartNS)/uint64(profilerSamplingRate.Nanoseconds()))+1)
for start.Value != nil {
        var bucket = start.Value.(*profileSamplesBucket)
        if bucket.relativeTimeNS < relativeStartNS {
                break
        }
        samplesCount += len(bucket.goIDs)
        buckets = append(buckets, bucket)
        start = start.Prev()

My guess would be that the bug is the following: start is a 30s ring structure. Our execution was some ms longer than the 30s. Therefore, all values in the ring are filled. since this loop is not clearing the data, it may loop endlessly around with start.Prev() 🤷

What could we do:

  1. Report this issue to Sentry
  2. (Decrease the maximum execution duration by some seconds)
  3. Disable the Profile Sampling (profilessamplerate: 0.0)

@sentry-io
Copy link

sentry-io bot commented Sep 20, 2023

Sentry issue: POSEIDON-4H

@MrSerth
Copy link
Member Author

MrSerth commented Sep 20, 2023

Ah, thanks for investigating here! I just went ahead and disabled the profile sampling (by setting profilessamplerate: 0.0). With this, we should be able to confirm our assumption about the root cause. If this doesn't help, we could also consider downgrading the sentry dependency to limit the cause there.

@MrSerth
Copy link
Member Author

MrSerth commented Sep 20, 2023

It seems like the issue didn't occur another time, that's good! 🥳

Nevertheless, I was further able to confirm that it is indeed a reproducible problem. Using CodeOcean and our Python 3.8 exercise (with profilessamplerate: 1.0 in Poseidon), I was able to observe the issue locally. Running the attached drawing.py multiple times caused the bug to appear. Still, not very easy to see, but something like 1 out of 10 (or 1 out of 7) caused the issue for me.

@mpass99
Copy link
Collaborator

mpass99 commented Sep 23, 2023

You're right, thanks for drawing further attention to that. With the following configuration, I was able to reproduce the behavior (10/10):

  • profilessamplerate: 1.0
  1. Provide a runner {"inactivityTimeout": 600,"executionEnvironmentId": 1}
  2. Create an execution with a time limit of 35s and let it exceed {"command": "sleep 40","environment": {},"timeLimit": 35}
  3. Connect to the execution.

This creates the described memory leak in Sentry's profiler.go. Furthermore, this even easily allows to attach a debugger. As expected, neither break-condition is triggered and the loop runs endlessly.

@MrSerth
Copy link
Member Author

MrSerth commented Sep 23, 2023

Awesome, it's great to hear we have these simple reproduction steps, at least with Poseidon. Can we add a regression test for that, even when no DSN is set? This would at least allow us to identify the issue in the future.

And of course, we should probably now go one step further to avoid a memory leak even when the profiler is being used.

@MrSerth
Copy link
Member Author

MrSerth commented Sep 25, 2023

Today, we discussed this issue again. For now, we won't add a regression test, since this would mainly cover code originally written in the Sentry dependency. Nevertheless, we agreed to create a bug report (ideally with a minimal reproduction example independent of Poseidon) for the Sentry team to take a look at this issue.

@mpass99
Copy link
Collaborator

mpass99 commented Oct 1, 2023

Submitted @ getsentry/sentry-go#724

@mpass99
Copy link
Collaborator

mpass99 commented Oct 4, 2023

The published release 0.25.0 resolves this issue (tested with both the minimal example and our Poseidon reproduction). Therefore, we can think about updating this dependency and enabling the profile sampling again.

@MrSerth
Copy link
Member Author

MrSerth commented Oct 4, 2023

Yes, please. Let's do so!

@mpass99
Copy link
Collaborator

mpass99 commented Oct 6, 2023

Done with 3d87cfc and codeocean-terraform#8cbc9183

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working deployment Everything related to our production environment
Projects
None yet
Development

No branches or pull requests

2 participants