Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FindProvidersAsync misses a provider #750

Open
nikosft opened this issue Oct 17, 2021 · 6 comments
Open

FindProvidersAsync misses a provider #750

nikosft opened this issue Oct 17, 2021 · 6 comments

Comments

@nikosft
Copy link

nikosft commented Oct 17, 2021

TL;DR There is a race condition that causes FindProvidersAsync to miss a provider

I was experimenting with testground using a lossless network and TCP. In my experiment I am using 200 nodes, each providing a record. Then each node executes FindProvidersAsync for all 200 records. Many times 3-4 out of 40000 requests were not successful but this shouldn't happen.

Through trial and error I figured out that this was caused here

if ps.TryAdd(prov.ID) {

ps.TryAdd(prov.ID) increases the size of ps and this is a sufficient condition for this function to return true:

func() bool {

However, this causes followUpCtx to cancel hence the code jumps to

case <-ctx.Done():

If this happens fast, the provider may not be returned back, since this happens in the previous line:

case peerOut <- *prov:

I (probably) solved this problem by using a lock. So lines 533-541 in my code look like this

peersetMutex.Lock()
if ps.TryAdd(prov.ID) {
	logger.Debugf("using provider: %s", prov)
	select {
	case peerOut <- *prov:
	case <-ctx.Done():
		logger.Debug("context timed out sending more providers")
		return nil, ctx.Err()
	}
}
peersetMutex.Unlock()

And then I modified the function in line 559 as follows:

func() bool {
	peersetMutex.Lock()
	output:= !findAll && ps.Size() >= count
	peersetMutex.Unlock()
	return output
}
@vyzo
Copy link
Contributor

vyzo commented Oct 17, 2021 via email

@nikosft
Copy link
Author

nikosft commented Oct 17, 2021

In order to determine whether the bug is there, I used a variable that counted the records sent through peerOut. So the validation code looked similar to the following:

if ps.TryAdd(prov.ID) {
	logger.Debugf("using provider: %s", prov)
	select {
	case peerOut <- *prov:
             returnedItems++
	case <-ctx.Done():
		logger.Debug("context timed out sending more providers")
                if returnedItems < ps.Size() {
                      logger.Debug("!!!Something strange happened!!!")
                 } 
		return nil, ctx.Err()
	}
}

May be try something like that, i.e., add somewhere a condition that checks how many providers we have send back before closing the channel.

@vyzo
Copy link
Contributor

vyzo commented Oct 18, 2021

@aschmahmann thoughts on the matter?

@aschmahmann
Copy link
Contributor

@vyzo without looking too deeply at this I'd say I'd believe that a context cancellation race could occur here. There's some tradeoff around if the channel is allowed to backup and hold open a goroutine due to backpressure or if there's a maximal amount of waiting time. For example, using locking if findProvidersAsync returned a channel that didn't close until the caller had read everything off the channel it could be waiting for a while. That's not necessarily a bad thing, but it's a tradeoff here that doesn't seem very worthwhile.

This doesn't seem like a high priority item, but a fix that doesn't involve an excessive amount of locking seems reasonable. One fix that comes to mind is adding a condition in the cancel case that checks why it was cancelled, and if it was due to a Stop condition then decide when to abort. We'd have to decide what the abort semantics are though, such as:

  • immediately try sending down the channel or exiting with a send and default case switch (probably what I'd recommend)
  • wait a small amount of time (seems ok)
  • hang the goroutine until the consumer fully processes the channel (I understand this, but would need some convincing since this seems likely to cause problems for consumers)

WDYT?

@nikosft
Copy link
Author

nikosft commented Oct 18, 2021

I run go test . -run TestProvidesMany -count=100 and three tests failed (with the error reported in issue #726). I modified

if ps.TryAdd(prov.ID) {

As follows:

if ps.TryAdd(prov.ID) {
   peerOut <- *prov
}

Which removes the race condition. Then I run go test . -run TestProvidesMany -count=100 three times and gave my no error. I did not extensively test it but there is a chance that issue #726 is related to this one.

@aschmahmann WRT

One fix that comes to mind is adding a condition in the cancel case that checks why it was cancelled

The context included in the select statement is created here

followUpCtx, cancelFollowUp := context.WithCancel(ctx)
and it is always stopped when the requested number of providers is found. So it is not an exceptional case. This context is stopped with every FindProvidersAsync call and, at least in our experiments, the race condition appears approximately once every 1000 requests.

@vyzo
Copy link
Contributor

vyzo commented Oct 19, 2021

I don't think we should hang the goroutine -- immediately sending down the channel seems like the sanest option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants