Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

firestore: Deadlock when using emulator and parallel read in the transaction #2604

Closed
roblaszczak opened this issue Jul 14, 2020 · 14 comments
Closed
Assignees
Labels
api: firestore Issues related to the Firestore API. external This issue is blocked on a bug with the actual product. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@roblaszczak
Copy link

roblaszczak commented Jul 14, 2020

Client

Firestore

Environment

Emulator running in the way described here: https://firebase.google.com/docs/rules/emulator-setup

firebase emulators:start --only firestore

Go Environment

$ go version
go version go1.14.4 linux/amd64

$ go env
O111MODULE=""
GOARCH="amd64"
GOBIN="/home/robert/go/bin"
GOCACHE="/home/robert/.cache/go-build"
GOENV="/home/robert/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/robert/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/lib/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build613925690=/tmp/go-build -gno-record-gcc-switches"
FIRESTORE_EMULATOR_HOST=localhost:8080

Code

e.g.

package main

import (
	"context"
	"fmt"
	"sync"
	"time"

	"cloud.google.com/go/firestore"
)

func main() {
	// it can be even 1 minute, it's still working in the same way
	timeout := time.Second * 1

	ctx, cancel := context.WithTimeout(context.Background(), timeout)
	defer cancel()

	workersCount := 2 // two workers are enough to reproduce
	workersDone := sync.WaitGroup{}
	workersDone.Add(workersCount)

	docID := fmt.Sprintf("%d", time.Now().UnixNano())

	startWorkers := make(chan struct{})

	// it doesn't matter if I'm re-using the client and using separate one
	client := newFirestoreClient()

	for worker := 0; worker < workersCount; worker++ {
		workerNum := worker

		go func() {
			defer workersDone.Done()
			<-startWorkers

			err := client.RunTransaction(ctx, func(ctx context.Context, transaction *firestore.Transaction) error {
				docRef := client.Collection("test-coll").Doc(docID)

				// without that function, it's working fine
				_, _ = transaction.Get(docRef)

				if err := transaction.Set(docRef, map[string]string{"foo": "bar"}); err != nil {
					return err
				}

				return nil
			})
			fmt.Printf("err from worker %d: %+v\n", workerNum, err)
		}()
	}

	close(startWorkers)
	workersDone.Wait()
}


func newFirestoreClient() *firestore.Client {
	firebaseClient, err := firestore.NewClient(context.Background(), "project-id")
	if err != nil {
		panic(err)
	}
	return firebaseClient
}

Expected behavior

No deadlock in RunTransaction.

Actual behavior

After running the code, RunTransaction is blocked until the context is canceled because of timeout.

err from worker 0: rpc error: code = DeadlineExceeded desc = context deadline exceeded
err from worker 1: rpc error: code = DeadlineExceeded desc = context deadline exceeded

When we will get rid of _, _ = transaction.Get(docRef) everything is working fine.

Additional context

I'm not sure that it's not just a problem with the newest version of the emulator or client - we have this issue for a couple of months.
It's also hard to say if it's the emulator problem or Go client. Anyway, I tried to find any similar issues in other clients and I didn't found anything useful.

If you think that it may be useful, I can also create issue for the emulator and link it here :)

@roblaszczak roblaszczak added the triage me I really want to be triaged. label Jul 14, 2020
@product-auto-label product-auto-label bot added the api: firestore Issues related to the Firestore API. label Jul 14, 2020
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Jul 19, 2020
@tritone
Copy link
Contributor

tritone commented Jul 22, 2020

Thanks for filing the issue. I don't see where in your code or environment you are enabling the emulator-- could you clarify on this?

@BenWhitehead do we support this functionality through the emulator?

@tritone tritone added needs more info This issue needs more information from the customer to proceed. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Jul 22, 2020
@roblaszczak
Copy link
Author

Hello @tritone, I added the info :)

@tritone
Copy link
Contributor

tritone commented Jul 22, 2020

Sounds good-- just wanted to confirm that you are also using FIRESTORE_EMULATOR_HOST to indicate that the client library should call the emulator?

@roblaszczak
Copy link
Author

Yes, I also added that for clarification.

As I said, commenting _, _ = transaction.Get(docRef) out is fixing the issue, so I'm sure that the problem is not with connection timeout, but with transaction.

@tritone
Copy link
Contributor

tritone commented Jul 22, 2020

Great; thanks very much for the quick response. I'll make sure this use case is supported and we can look into filing a bug in the emulator if the problem is there.

@tritone tritone removed the needs more info This issue needs more information from the customer to proceed. label Jul 22, 2020
@roblaszczak
Copy link
Author

Do you have any news about that @tritone?

I'm asking, because I'm writing an article about testing repositories which includes Firestore, MySQL, and in-memory implementation.

Unfortunately, in Firestore implementation I need to explicitly skip some tests because they are failing because of that bug :(

@tritone
Copy link
Contributor

tritone commented Jul 27, 2020

I didn't see any obvious reason that this shouldn't be supported in the emulator. @BenWhitehead is taking a look-- he's going to see if he can repro using the Java client as well. We will keep you posted.

@BenWhitehead
Copy link

I've spent some time attempting to create a repro in Java. Some mixed news I'm afraid. I was able to create what appeared to be a hang when attempting to have multiple transactions modify the same document at the same time, however, the hang resulted in a transaction timeout and each of the failed transactions were retried and then succeed. The noticeable hang I was able to observe in my testing was ~30 seconds while the transaction was timing out.

A few more questions to try and help narrow things down:

  1. When you are observing the deadlock, are you able to determine if it is longer than 30 seconds?
  2. In the stdout of your emulator do you ever see the message INFO: operation failed: transaction timeout?
  3. Are you using a new emulator process each time? (The reason I ask this, is I observed that transaction timeouts became less likely the longer I had the emulator running and was running my tests. My suspicion here is that the emulators JVM is able to JIT some optomizations to reduce the window of possible transaction contention.)

@roblaszczak
Copy link
Author

roblaszczak commented Jul 30, 2020

hello @BenWhitehead :)

ad 1

I increased timeout to 60s on the client level, and it's correct - it's timeouts after 30s

2020/07/30 08:45:12 create firestore
2020/07/30 08:45:12 starting workers
2020/07/30 08:45:12 running tx
2020/07/30 08:45:12 running tx
2020/07/30 08:45:42 err from worker 0: rpc error: caode = Unknown desc =
2020/07/30 08:45:42 err from worker 1: rpc error: code = Unknown desc =

ad 2

When I'm setting 60s timeout on the client level

Jul 30, 2020 8:45:12 AM io.gapi.emulators.netty.HttpVersionRoutingHandler channelRead
INFO: Detected HTTP/2 connection.
Jul 30, 2020 8:45:31 AM com.google.cloud.datastore.emulator.impl.util.WrappedStreamObserver onError
INFO: operation failed: transaction timeout
Jul 30, 2020 8:45:31 AM com.google.cloud.datastore.emulator.impl.util.WrappedStreamObserver onError
INFO: operation failed: transaction timeout
Jul 30, 2020 8:45:42 AM com.google.cloud.datastore.emulator.impl.util.WrappedStreamObserver onError
INFO: operation failed: transaction timeout
Jul 30, 2020 8:45:42 AM com.google.cloud.datastore.emulator.impl.util.WrappedStreamObserver onError
INFO: operation failed: transaction timeout

With smaller 1s timeout on the consumer level, it actually also return an error. But what is interesting it's after the client is already dead (I assume that it's not handling the situation when the client connection is closed?)

Jul 30, 2020 8:48:22 AM io.gapi.emulators.netty.HttpVersionRoutingHandler channelRead
INFO: Detected HTTP/2 connection.
Jul 30, 2020 8:48:52 AM com.google.cloud.datastore.emulator.impl.util.WrappedStreamObserver onError
INFO: operation failed: transaction timeout
Jul 30, 2020 8:48:52 AM com.google.cloud.datastore.emulator.impl.util.WrappedStreamObserver onError
INFO: operation failed: transaction timeout

ad 3

It's interesting, but on the first run the error is different. But anyway, the problem is the same (deadlock and error after client or server timeout, in that case it's on the client side):

# first run
2020/07/30 08:51:21 err from worker 0: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
2020/07/30 08:51:21 err from worker 1: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL

# second run
2020/07/30 08:51:37 err from worker 0: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2020/07/30 08:51:37 err from worker 1: rpc error: code = DeadlineExceeded desc = context deadline exceeded

# third run
2020/07/30 08:51:46 err from worker 1: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2020/07/30 08:51:46 err from worker 0: rpc error: code = DeadlineExceeded desc = context deadline exceeded

UPDATE:
I also tried to run fresh emulator and 60s timeout, in that case server didn't timeout after 30s as usual, but client timeout worked after 60s

2020/07/30 08:53:23 create firestore
2020/07/30 08:53:23 starting workers
2020/07/30 08:53:23 running tx
2020/07/30 08:53:23 running tx
2020/07/30 08:54:23 err from worker 1: context deadline exceeded
2020/07/30 08:54:23 err from worker 0: context deadline exceeded

@BenWhitehead
Copy link

Thanks for the additional info @roblaszczak. This sounds like it is something that will need some deeper investigation into the internals for the Go Firestore client to determine the best way to address the issue.

There is an integration tests in the Java Firestore client that specifically tests this type of scenario and ensure that after the contention, there is a re-attempt. We probably need to ensure that something similar is added to the test suite here.

@mad-it
Copy link

mad-it commented Aug 31, 2020

95% certain this is related to firebase/firebase-tools#2452

@yoshi-automation yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Nov 29, 2020
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Jan 10, 2021
@tritone tritone assigned crwilcox and unassigned tritone and crwilcox Jan 22, 2021
@crwilcox crwilcox added the external This issue is blocked on a bug with the actual product. label Jan 26, 2021
@yoshi-automation yoshi-automation removed the 🚨 This issue needs some love. label Jan 26, 2021
@lutogin
Copy link

lutogin commented Feb 1, 2021

Any progress on this issue?

@crwilcox
Copy link
Contributor

Follow for updates at firebase/firebase-tools#2452

@crwilcox crwilcox assigned dmahugh and unassigned crwilcox Nov 1, 2021
@meredithslota
Copy link
Contributor

Linked issue has now been closed, see firebase/firebase-tools#4400

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: firestore Issues related to the Firestore API. external This issue is blocked on a bug with the actual product. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

9 participants