Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page cache memory usage regression with gin-gonic after 0.21->0.22 upgrade #673

Closed
vitaliyf opened this issue Jul 24, 2023 · 12 comments
Closed
Assignees

Comments

@vitaliyf
Copy link

Summary

After go-sentry 0.21.0 to 0.22.0 upgrade (verified by downgrading) we found that our small gin-gonic-based service is "leaking" pagecache memory.

Steps To Reproduce

I don't have an easy way to provide a code snippet, let me know if perhaps this is enough to point in the direction of the problem. If not - I will try to come up with more details.

We have:

  • A small REST API service using gin-gonic 0.8.2, golang 1.20.6, running inside ubuntu 22.04 container
  • It accepts POST of file uploads (zip files, up to 100s of megabytes each)
  • The code then looks at c.Request.MultipartForm.File and processes each file (unzips to see what's inside)
  • Contents of the file is then uploaded to object storage and then deleted

After go-sentry 0.22 upgrade (verified by reverting the upgrade back to 0.21 with no other changes) we noticed that the metric for "page cache" (container_memory_cache in Kubernetes) keeps going up (into several gigabytes), as POST requests are processed. Before the upgrade, the usage would spike for a moment for each request and then immediately go back to almost nothing.

Our theory is that one of the new gin features added in 0.22 are somehow intercepting request lifecycle in some way that is changing what happens with the temporary file(s) gin creates to handle the large POST body.

Expected Behavior

Before the upgrade, size of page cache did not increase.

Screenshots

Sentry.io Event

Environment

SDK

  • sentry-go version: 1.22.0
  • Go version: 1.20.6
  • Using Go Modules? yes

Sentry

  • Using hosted Sentry in sentry.io? no
  • Using your own Sentry installation? Version: 23.3.1
  • Anything particular to your environment that could be related to this issue?

Additional context

@cleptric
Copy link
Member

This is likely caused by https://github.com/getsentry/sentry-go/blob/master/gin/sentrygin.go#L73, which was added in #644. We'll take a look!

@cleptric
Copy link
Member

@zetaab Did you notice similar issues, assuming you're using gin performance tracing in one of your apps now?

@zetaab
Copy link
Contributor

zetaab commented Jul 27, 2023

Our apps are internal small applications, I have not noticed any issues like this

And we are not using any file uploads

@zetaab
Copy link
Contributor

zetaab commented Jul 27, 2023

I am thinking that in case of file uploads it will try to read body and it should be closed after read. Otherwise it will reserve the memory and the behaviour is like described in first post. Now someone just needs to find the correct place to add close

@greywolve
Copy link
Contributor

greywolve commented Oct 13, 2023

I've managed to replicate this behaviour in 0.22 by creating a small example which uploads a file. Indeed the multipart temporary files are not cleaned up. When I try the same example with 0.21 the temporary files are cleaned up.

After a bunch of digging I've realised that this is not an issue caused by the Sentry SDK. It's really an issue in the way net/http deals with modifying a request.

I can also replicate the issue like this:

package main

import (
	"fmt"
	"net/http"
	"os"
)

func main() {

	fmt.Println("OS temp dir path:", os.TempDir())

	http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
		r.ParseMultipartForm(0)

		file, handler, err := r.FormFile("file")
		if err != nil {
			fmt.Println(err)
			return
		}
		defer file.Close()
	})

	http.ListenAndServe("127.0.0.1:3000", testMiddleware(http.DefaultServeMux))

}

func testMiddleware(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		ctx := r.Context()
		r = r.WithContext(ctx)
		next.ServeHTTP(w, r)
	})
}

The primary culprit here is this line r = r.WithContext(ctx)

This creates a shallow copy of the request that's passed on to the next handler (since the docs say not to modify the original request). If a handler further up then parses a multipart form the result ends up on the modified shallow request, not the original request. Cleanup of temp files happens when the request finishes, but this only applies to the original request. If there's no multipart result, then no cleanup happens.

Version 0.22 added starting a transaction for tracing to the Gin integration. That has to modify the request to inject the span.

So sadly the solution here is always to make sure you clean up yourself by calling the RemoveAll method on the form.

In the above example we'd defer:

r.MultipartForm.RemoveAll()

For reference see:

@greywolve
Copy link
Contributor

@cleptric I think we can close this issue, there isn't much we can do from our side.

@chawco
Copy link

chawco commented Oct 13, 2023

@cleptric I think we can close this issue, there isn't much we can do from our side.

I don't think this is true. The bug report is about page cache -- not the temp files. We already call RemoveAll to clean up multipart forms. That's normal, and documented in the API. To us it appears that some object captured by gin-gonic is leaking references or file handles or something. Were you able to reproduce that?

@greywolve
Copy link
Contributor

@chawco thanks for the response. :)

I will try reproduce more and see if I can see any file handle or reference leaks.

@greywolve
Copy link
Contributor

greywolve commented Oct 20, 2023

@chawco @vitaliyf

I've been trying to check for any leaks using strace on Ubuntu. So far I can't seem to see any indication that files are being leaked.

Here's the strace output when uploading a file for v0.21:

30860 16:20:31 openat(AT_FDCWD, "/tmp/multipart-1368728943", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0600) = 9
30860 16:20:32 close(9)                 = 0
30860 16:20:32 openat(AT_FDCWD, "/tmp/multipart-1368728943", O_RDONLY|O_CLOEXEC) = 9
30860 16:20:32 close(9)                 = 0
30860 16:20:32 openat(AT_FDCWD, "/tmp/multipart-1368728943", O_RDONLY|O_CLOEXEC) = 9
30860 16:20:32 openat(AT_FDCWD, "e5535221-c915-4ee7-80de-013d6f6ea463.zip", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 10
30860 16:20:32 close(10)                = 0
30860 16:20:32 close(9)                 = 0
30830 16:20:32 close(7)                 = 0

And here's v0.22:

31347 16:22:22 openat(AT_FDCWD, "/tmp/multipart-3182529862", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0600) = 9
31347 16:22:22 close(9)                 = 0
31347 16:22:22 openat(AT_FDCWD, "/tmp/multipart-3182529862", O_RDONLY|O_CLOEXEC) = 9
31347 16:22:22 close(9)                 = 0
31347 16:22:22 openat(AT_FDCWD, "/tmp/multipart-3182529862", O_RDONLY|O_CLOEXEC) = 9
31347 16:22:22 openat(AT_FDCWD, "8808efbf-e72a-4091-a723-6142c1fe2677.zip", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 10
31347 16:22:22 close(10)                = 0
31347 16:22:22 close(9)                 = 0
31327 16:22:22 close(7)                 = 0

It doesn't seem like anything has changed, and all the opens seem to have matching closes.

container_memory_cache includes active files, inactive files, and tmpfs.

From the Docker runtime docs:

The amount of memory used by the processes of this control group that can be associated precisely with a block on a block device. When you read from and write to files on disk, this amount increases. This is the case if you use "conventional" I/O (open, read, write syscalls) as well as mapped files (with mmap). It also accounts for the memory used by tmpfs mounts, though the reasons are unclear.

And

The exact formula is cache = active_file + inactive_file + tmpfs.

So temporary files created by multipart parsing can certainly push that metric up. (and if they weren't cleaned up it would remain high)

Was your container OOMing from this? What was happening with container_memory_working_set_bytes during that time?

@chawco
Copy link

chawco commented Oct 23, 2023

I think we've found a place where the multipart form parsing was being done, but wasn't obvious, because it was being done by net/http via FormFile, so you're still expected to call RemoveAll afterward. So I think the mystery is solved here -- once we added the cleanup in that last spot the memory usage remained low even with more recent Sentry releases.

It may be good to mention in the docs, however, as if someone wasn't cleaning it up previously they were just being left with some temp files on disk, and not leaking the memory. It would be pretty unexpected for a minor Sentry version bump to cause a memory leak due to a pretty benign pre-existing bug.

@greywolve
Copy link
Contributor

@chawco good point, we'll definitely make a note of this gotcha in the docs.

greywolve added a commit to greywolve/sentry-docs that referenced this issue Oct 31, 2023
Add a section on avoiding temporary file leaks in file upload handlers when using Go SDK middleware.

See this issue too: getsentry/sentry-go#673
greywolve added a commit to greywolve/sentry-docs that referenced this issue Oct 31, 2023
Add a section on avoiding temporary file leaks when using Go SDK middleware with file upload handlers.

See this issue: getsentry/sentry-go#673
shanamatthews added a commit to getsentry/sentry-docs that referenced this issue Oct 31, 2023
* feat(go): add temp file leak section to troubleshooting

Add a section on avoiding temporary file leaks when using Go SDK middleware with file upload handlers.

See this issue: getsentry/sentry-go#673

* Update src/platforms/go/common/troubleshooting.mdx

---------

Co-authored-by: Shana Matthews <shana.l.matthews@gmail.com>
@cleptric cleptric closed this as completed Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants