Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"fatal error: concurrent map writes" when submitting transaction #570

Closed
SupaHam opened this issue Feb 6, 2023 · 6 comments · Fixed by #638
Closed

"fatal error: concurrent map writes" when submitting transaction #570

SupaHam opened this issue Feb 6, 2023 · 6 comments · Fixed by #638

Comments

@SupaHam
Copy link

SupaHam commented Feb 6, 2023

Summary

When span.Finish is called, there's a chance a fatal concurrent write could occur due to the forceful setting of device parameters arch and cpu.

Steps To Reproduce

  1. Set "device" context on scope concurrently using scope.SetContext
  2. In parallel, create and Finish a span as is.
  3. fatal error: concurrent map writes
scope.SetContext("device", map[string]interface{}{
  "name": model
})

Expected Behavior

Don't crash the go application because integrations have no concept of context safety.

Screenshots

Sentry.io Event

Environment

SDK

  • sentry-go version: v0.16.0
  • Go version: 1.18
  • Using Go Modules? yes

Sentry

  • Using hosted Sentry in sentry.io? yes
  • Using your own Sentry installation? Version:
  • Anything particular to your environment that could be related to this issue?

Additional context


goroutine 140293796 [running]:
runtime.throw({0x2935f76?, 0x24b8a60?})
        /usr/local/go/src/runtime/panic.go:992 +0x71 fp=0xc01795ddc0 sp=0xc01795dd90 pc=0x43bcb1
runtime.mapassign_faststr(0x24b8640, 0xc0297caff0, {0x2909cb4, 0x4})
        /usr/local/go/src/runtime/map_faststr.go:295 +0x38b fp=0xc01795de28 sp=0xc01795ddc0 pc=0x416acb
github.com/getsentry/sentry-go.(*environmentIntegration).processor(0x0?, 0xc009852a00, 0xc01795dd38?)
        /build/.go/pkg/mod/github.com/getsentry/sentry-go@v0.16.0/integrations.go:89 +0x21a fp=0xc01795ded0 sp=0xc01795de28 pc=0x8d62ba
github.com/getsentry/sentry-go.(*environmentIntegration).processor-fm(0xc006d61560?, 0xc009852a00?)
        <autogenerated>:1 +0x2f fp=0xc01795def8 sp=0xc01795ded0 pc=0x8e5b6f
github.com/getsentry/sentry-go.(*Client).prepareEvent(0xc000d6e000, 0xc009852a00, 0x0?, {0x30983c0, 0xc006d61560})
        /build/.go/pkg/mod/github.com/getsentry/sentry-go@v0.16.0/client.go:642 +0x6a3 fp=0xc01795ecf0 sp=0xc01795def8 pc=0x8d2683
github.com/getsentry/sentry-go.(*Client).processEvent(0xc000d6e000, 0xc009852a00, 0x0, {0x30983c0, 0xc006d61560})
        /build/.go/pkg/mod/github.com/getsentry/sentry-go@v0.16.0/client.go:569 +0x185 fp=0xc01795eff8 sp=0xc01795ecf0 pc=0x8d1d85
github.com/getsentry/sentry-go.(*Client).CaptureEvent(...)
        /build/.go/pkg/mod/github.com/getsentry/sentry-go@v0.16.0/client.go:395
github.com/getsentry/sentry-go.(*Hub).CaptureEvent(0xc0255b39b0, 0xc009852a00)
        /build/.go/pkg/mod/github.com/getsentry/sentry-go@v0.16.0/hub.go:224 +0x66 fp=0xc01795f040 sp=0xc01795eff8 pc=0x8d5146
github.com/getsentry/sentry-go.(*Span).Finish(0xc000e4fb20)
        /build/.go/pkg/mod/github.com/getsentry/sentry-go@v0.16.0/tracing.go:185 +0x174 fp=0xc01795f0a0 sp=0xc01795f040 pc=0x8defb4
...

This issue has been around for around at least a year. This environment is a game server that gains a lot of unique connections and requires has multiple goroutines running at the initialisation of the connection that are spawning spans, while asynchronously manipulating context as new data comes in.

@tonyo
Copy link
Member

tonyo commented Feb 6, 2023

Thanks for reporting, we'll look into this 👍

@cleptric
Copy link
Member

cleptric commented Feb 6, 2023

I think we need locks here https://github.com/getsentry/sentry-go/blob/master/integrations.go#L87 and in all other places.

@tonyo
Copy link
Member

tonyo commented Feb 23, 2023

Spent some time on this.
At the moment I don't see a clear scenario of how this specific error could be triggered, given the info provided in this issue 🤔
Like, scope.SetContext and event.Contexts here are essentially operating on different maps, so I suspect there might be something else in play.

Tried to reproduce it with the following test (runs for 10 minutes), but it passes:

// Running like:
// go test -run ^TestSetContextAndFinishTransaction$ -count 1 -race
//
func TestSetContextAndFinishTransaction(t *testing.T) {
	ctx := NewTestContext(ClientOptions{
		EnableTracing:    true,
		TracesSampleRate: 1.0,
	})
	hub := GetHubFromContext(ctx)
	scope := hub.Scope()

	transaction := StartTransaction(ctx, "op")

	go func() {
		for {
			transaction.Finish()
		}
	}()

	go func() {
		for {
			scope.SetContext("device", map[string]interface{}{
				"name": "value",
			})
		}
	}()

	time.Sleep(600 * time.Second)

	fmt.Println("Done!")
}

For comparison, the following test with concurrent map writes fails within 1 second:

func TestConcurrentMapWrite(t *testing.T) {
	m := map[string]string{}

	go func() {
		for {
			m["key1"] = "value1"
		}
	}()

	go func() {
		for {
			m["key2"] = "value2"
		}
	}()

	time.Sleep(1 * time.Second)

	fmt.Println("Done!")
}

For now I can think of the following alternative scenarios that may cause this:

  1. span.Finish (for a transaction span) is called multiple times (for whatever reason), which leads to a data race here. This is something that we generally want to prevent/fix, opened a separate issue for it: span.Finish() should be a noop when span is already finished #587
  2. scope.SetContext(key, newContext) takes newContext that is also modified elsewhere in parallel.

@SupaHam if at some point you have any additional information (e.g. more stack trace examples, ideally where the traces of both goroutines are present), that would help us a lot 🙏

@SupaHam
Copy link
Author

SupaHam commented Mar 2, 2023

Hi @tonyo, sorry for the wait.

Your test will not trigger the issue as SetContext is locked, so that will always be safe to run asynchronously. The reported issue is, specifically step 2, is multiple spans (transactions) modifying the same scope asynchronously. Apologies if the issue wasn't clearly described, here is a test that should trigger it for you locally - increase runs if test passes to increasing chances of fatal panic.

func TestAB(t *testing.T) {
	ctx := NewTestContext(sentry.ClientOptions{
		EnableTracing:    true,
		TracesSampleRate: 1,
	})
	hub := sentry.GetHubFromContext(ctx)
	c := make(chan *sentry.Span)
	const runs = 100
	for i := 0; i < runs; i++ {
		go func() {
			span := sentry.StartSpan(ctx, "test", sentry.TransactionName("test"))
			c <- span
			go hub.Scope().SetContext("device", map[string]interface{}{"name": "value"})
		}()
	}
	var wg sync.WaitGroup
	defer wg.Wait()
	for i := 0; i < 4; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			for span := range c {
				go span.Finish()
			}
		}()
	}
}

@cleptric
Copy link
Member

cleptric commented Mar 6, 2023

Thanks for the follow-up.
I do acknowledge that we need to fix the race condition, however, I wanted to follow up on your initial example.

We normally recommend that each go routine uses its own hub, as described here: https://docs.sentry.io/platforms/go/concurrency/
As you're using the global hub, there might be further issues like hub/scope bleed happening. For error reporting, I would highly recommend you use a local hub, in case of performance monitoring it depends on your use case. Do you need to mutate the global scope in your case?

@SupaHam
Copy link
Author

SupaHam commented Mar 6, 2023

Hi @cleptric. Firstly, I'd like to apologise for not being able to expose our internal proprietary code, but I was trying to demonstrate an example of how it fails. Just to clarify, there is one hub per player in our game, but the loading of player data is done asynchronously. Changing that to be synchronous would add serious wait times and we cannot afford that solution. Yet, the problem is that there is that sliver of a chance for a panic, and it disconnects players from their gaming experience which is very abrupt and ugly for user retention.

I'm not quite sure I see an issue with the usage of asynchronous loading, i think the locking can be justified.

EDIT: I would like to update you on this issue, we have prevented the panics by forcefully setting the properties when creating the context so that this specific faulty map write doesn't ever happen again. But that doesn't seem user friendly, and I can imagine more users coming across this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants