Page cache memory usage regression with gin-gonic after 0.21->0.22 upgrade #673

vitaliyf · 2023-07-24T22:23:01Z

Summary

After go-sentry 0.21.0 to 0.22.0 upgrade (verified by downgrading) we found that our small gin-gonic-based service is "leaking" pagecache memory.

Steps To Reproduce

I don't have an easy way to provide a code snippet, let me know if perhaps this is enough to point in the direction of the problem. If not - I will try to come up with more details.

We have:

A small REST API service using gin-gonic 0.8.2, golang 1.20.6, running inside ubuntu 22.04 container
It accepts POST of file uploads (zip files, up to 100s of megabytes each)
The code then looks at c.Request.MultipartForm.File and processes each file (unzips to see what's inside)
Contents of the file is then uploaded to object storage and then deleted

After go-sentry 0.22 upgrade (verified by reverting the upgrade back to 0.21 with no other changes) we noticed that the metric for "page cache" (container_memory_cache in Kubernetes) keeps going up (into several gigabytes), as POST requests are processed. Before the upgrade, the usage would spike for a moment for each request and then immediately go back to almost nothing.

Our theory is that one of the new gin features added in 0.22 are somehow intercepting request lifecycle in some way that is changing what happens with the temporary file(s) gin creates to handle the large POST body.

Expected Behavior

Before the upgrade, size of page cache did not increase.

Screenshots

Sentry.io Event

Environment

SDK

sentry-go version: 1.22.0
Go version: 1.20.6
Using Go Modules? yes

Sentry

Using hosted Sentry in sentry.io? no
Using your own Sentry installation? Version: 23.3.1
Anything particular to your environment that could be related to this issue?

Additional context

The text was updated successfully, but these errors were encountered:

cleptric · 2023-07-24T22:27:24Z

This is likely caused by https://github.com/getsentry/sentry-go/blob/master/gin/sentrygin.go#L73, which was added in #644. We'll take a look!

cleptric · 2023-07-27T09:01:42Z

@zetaab Did you notice similar issues, assuming you're using gin performance tracing in one of your apps now?

zetaab · 2023-07-27T09:28:21Z

Our apps are internal small applications, I have not noticed any issues like this

And we are not using any file uploads

zetaab · 2023-07-27T09:50:03Z

I am thinking that in case of file uploads it will try to read body and it should be closed after read. Otherwise it will reserve the memory and the behaviour is like described in first post. Now someone just needs to find the correct place to add close

greywolve · 2023-10-13T14:44:17Z

I've managed to replicate this behaviour in 0.22 by creating a small example which uploads a file. Indeed the multipart temporary files are not cleaned up. When I try the same example with 0.21 the temporary files are cleaned up.

After a bunch of digging I've realised that this is not an issue caused by the Sentry SDK. It's really an issue in the way net/http deals with modifying a request.

I can also replicate the issue like this:

package main

import (
	"fmt"
	"net/http"
	"os"
)

func main() {

	fmt.Println("OS temp dir path:", os.TempDir())

	http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
		r.ParseMultipartForm(0)

		file, handler, err := r.FormFile("file")
		if err != nil {
			fmt.Println(err)
			return
		}
		defer file.Close()
	})

	http.ListenAndServe("127.0.0.1:3000", testMiddleware(http.DefaultServeMux))

}

func testMiddleware(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		ctx := r.Context()
		r = r.WithContext(ctx)
		next.ServeHTTP(w, r)
	})
}

The primary culprit here is this line r = r.WithContext(ctx)

This creates a shallow copy of the request that's passed on to the next handler (since the docs say not to modify the original request). If a handler further up then parses a multipart form the result ends up on the modified shallow request, not the original request. Cleanup of temp files happens when the request finishes, but this only applies to the original request. If there's no multipart result, then no cleanup happens.

Version 0.22 added starting a transaction for tracing to the Gin integration. That has to modify the request to inject the span.

So sadly the solution here is always to make sure you clean up yourself by calling the RemoveAll method on the form.

In the above example we'd defer:

r.MultipartForm.RemoveAll()

For reference see:

greywolve · 2023-10-13T14:45:34Z

@cleptric I think we can close this issue, there isn't much we can do from our side.

chawco · 2023-10-13T15:51:33Z

@cleptric I think we can close this issue, there isn't much we can do from our side.

I don't think this is true. The bug report is about page cache -- not the temp files. We already call RemoveAll to clean up multipart forms. That's normal, and documented in the API. To us it appears that some object captured by gin-gonic is leaking references or file handles or something. Were you able to reproduce that?

greywolve · 2023-10-13T16:03:25Z

@chawco thanks for the response. :)

I will try reproduce more and see if I can see any file handle or reference leaks.

greywolve · 2023-10-20T15:01:08Z

@chawco @vitaliyf

I've been trying to check for any leaks using strace on Ubuntu. So far I can't seem to see any indication that files are being leaked.

Here's the strace output when uploading a file for v0.21:

30860 16:20:31 openat(AT_FDCWD, "/tmp/multipart-1368728943", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0600) = 9
30860 16:20:32 close(9)                 = 0
30860 16:20:32 openat(AT_FDCWD, "/tmp/multipart-1368728943", O_RDONLY|O_CLOEXEC) = 9
30860 16:20:32 close(9)                 = 0
30860 16:20:32 openat(AT_FDCWD, "/tmp/multipart-1368728943", O_RDONLY|O_CLOEXEC) = 9
30860 16:20:32 openat(AT_FDCWD, "e5535221-c915-4ee7-80de-013d6f6ea463.zip", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 10
30860 16:20:32 close(10)                = 0
30860 16:20:32 close(9)                 = 0
30830 16:20:32 close(7)                 = 0

And here's v0.22:

31347 16:22:22 openat(AT_FDCWD, "/tmp/multipart-3182529862", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0600) = 9
31347 16:22:22 close(9)                 = 0
31347 16:22:22 openat(AT_FDCWD, "/tmp/multipart-3182529862", O_RDONLY|O_CLOEXEC) = 9
31347 16:22:22 close(9)                 = 0
31347 16:22:22 openat(AT_FDCWD, "/tmp/multipart-3182529862", O_RDONLY|O_CLOEXEC) = 9
31347 16:22:22 openat(AT_FDCWD, "8808efbf-e72a-4091-a723-6142c1fe2677.zip", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 10
31347 16:22:22 close(10)                = 0
31347 16:22:22 close(9)                 = 0
31327 16:22:22 close(7)                 = 0

It doesn't seem like anything has changed, and all the opens seem to have matching closes.

container_memory_cache includes active files, inactive files, and tmpfs.

From the Docker runtime docs:

The amount of memory used by the processes of this control group that can be associated precisely with a block on a block device. When you read from and write to files on disk, this amount increases. This is the case if you use "conventional" I/O (open, read, write syscalls) as well as mapped files (with mmap). It also accounts for the memory used by tmpfs mounts, though the reasons are unclear.

And

The exact formula is cache = active_file + inactive_file + tmpfs.

So temporary files created by multipart parsing can certainly push that metric up. (and if they weren't cleaned up it would remain high)

Was your container OOMing from this? What was happening with container_memory_working_set_bytes during that time?

chawco · 2023-10-23T17:56:34Z

I think we've found a place where the multipart form parsing was being done, but wasn't obvious, because it was being done by net/http via FormFile, so you're still expected to call RemoveAll afterward. So I think the mystery is solved here -- once we added the cleanup in that last spot the memory usage remained low even with more recent Sentry releases.

It may be good to mention in the docs, however, as if someone wasn't cleaning it up previously they were just being left with some temp files on disk, and not leaking the memory. It would be pretty unexpected for a minor Sentry version bump to cause a memory leak due to a pretty benign pre-existing bug.

greywolve · 2023-10-26T01:06:45Z

@chawco good point, we'll definitely make a note of this gotcha in the docs.

Add a section on avoiding temporary file leaks in file upload handlers when using Go SDK middleware. See this issue too: getsentry/sentry-go#673

Add a section on avoiding temporary file leaks when using Go SDK middleware with file upload handlers. See this issue: getsentry/sentry-go#673

* feat(go): add temp file leak section to troubleshooting Add a section on avoiding temporary file leaks when using Go SDK middleware with file upload handlers. See this issue: getsentry/sentry-go#673 * Update src/platforms/go/common/troubleshooting.mdx --------- Co-authored-by: Shana Matthews <shana.l.matthews@gmail.com>

cleptric · 2023-11-01T18:04:16Z

We updated the docs in getsentry/sentry-docs#8376

https://docs.sentry.io/platforms/go/troubleshooting/#avoiding-temporary-file-leaks-in-file-upload-handlers

vitaliyf added the Type: Bug label Jul 24, 2023

cleptric added the Status: In Progress label Jul 24, 2023

cleptric assigned greywolve Oct 9, 2023

cleptric added Type: Discussion and removed Status: In Progress Type: Bug labels Oct 13, 2023

greywolve added a commit to greywolve/sentry-docs that referenced this issue Oct 31, 2023

Update troubleshooting.mdx

7951f82

Add a section on avoiding temporary file leaks in file upload handlers when using Go SDK middleware. See this issue too: getsentry/sentry-go#673

greywolve mentioned this issue Oct 31, 2023

feat(go): add temp file leak section to troubleshooting getsentry/sentry-docs#8376

Merged

cleptric closed this as completed Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Page cache memory usage regression with gin-gonic after 0.21->0.22 upgrade #673

Page cache memory usage regression with gin-gonic after 0.21->0.22 upgrade #673

vitaliyf commented Jul 24, 2023

cleptric commented Jul 24, 2023

cleptric commented Jul 27, 2023

zetaab commented Jul 27, 2023 •

edited

zetaab commented Jul 27, 2023

greywolve commented Oct 13, 2023 •

edited

greywolve commented Oct 13, 2023

chawco commented Oct 13, 2023

greywolve commented Oct 13, 2023

greywolve commented Oct 20, 2023 •

edited

chawco commented Oct 23, 2023

greywolve commented Oct 26, 2023

cleptric commented Nov 1, 2023

Page cache memory usage regression with gin-gonic after 0.21->0.22 upgrade #673

Page cache memory usage regression with gin-gonic after 0.21->0.22 upgrade #673

Comments

vitaliyf commented Jul 24, 2023

Summary

Steps To Reproduce

Expected Behavior

Screenshots

Sentry.io Event

Environment

SDK

Sentry

Additional context

cleptric commented Jul 24, 2023

cleptric commented Jul 27, 2023

zetaab commented Jul 27, 2023 • edited

zetaab commented Jul 27, 2023

greywolve commented Oct 13, 2023 • edited

greywolve commented Oct 13, 2023

chawco commented Oct 13, 2023

greywolve commented Oct 13, 2023

greywolve commented Oct 20, 2023 • edited

chawco commented Oct 23, 2023

greywolve commented Oct 26, 2023

cleptric commented Nov 1, 2023

zetaab commented Jul 27, 2023 •

edited

greywolve commented Oct 13, 2023 •

edited

greywolve commented Oct 20, 2023 •

edited