FindProvidersAsync misses a provider #750

nikosft · 2021-10-17T12:31:09Z

TL;DR There is a race condition that causes FindProvidersAsync to miss a provider

I was experimenting with testground using a lossless network and TCP. In my experiment I am using 200 nodes, each providing a record. Then each node executes FindProvidersAsync for all 200 records. Many times 3-4 out of 40000 requests were not successful but this shouldn't happen.

Through trial and error I figured out that this was caused here

go-libp2p-kad-dht/routing.go

Line 533 in 5a8437e

if ps.TryAdd(prov.ID) {

ps.TryAdd(prov.ID) increases the size of ps and this is a sufficient condition for this function to return true:

go-libp2p-kad-dht/routing.go

Line 559 in 5a8437e

func() bool {

However, this causes followUpCtx to cancel hence the code jumps to

go-libp2p-kad-dht/routing.go

Line 537 in 5a8437e

case <-ctx.Done():

If this happens fast, the provider may not be returned back, since this happens in the previous line:

go-libp2p-kad-dht/routing.go

Line 536 in 5a8437e

case peerOut <- *prov:

I (probably) solved this problem by using a lock. So lines 533-541 in my code look like this

peersetMutex.Lock()
if ps.TryAdd(prov.ID) {
	logger.Debugf("using provider: %s", prov)
	select {
	case peerOut <- *prov:
	case <-ctx.Done():
		logger.Debug("context timed out sending more providers")
		return nil, ctx.Err()
	}
}
peersetMutex.Unlock()

And then I modified the function in line 559 as follows:

func() bool {
	peersetMutex.Lock()
	output:= !findAll && ps.Size() >= count
	peersetMutex.Unlock()
	return output
}

The text was updated successfully, but these errors were encountered:

vyzo · 2021-10-17T12:42:21Z

The scope of the lock is too long unfortunately, and it may be held for an arbitrary time period. So we need a better solution.

…

On Sun, Oct 17, 2021, 15:31 Nikos Fotiou ***@***.***> wrote: TL;DR There is a race condition that causes FindProvidersAsync to miss a provider I was experimenting with testground using a lossless network and TCP. In my experiment I am using 200 nodes, each providing a record. Then each node executes FindProvidersAsync for all 200 records. Many times 3-4 out of 40000 requests were not successful but this shouldn't happen. Through trial and error I figured out that this was caused here https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/routing.go#L533 ps.TryAdd(prov.ID) increases the size of ps and this is a sufficient condition for this function to return true: https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/routing.go#L559 However, this causes followUpCtx to cancel hence the code jumps to https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/routing.go#L537 If this happens fast, the provider may not be returned back, since this happens in the previous line: https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/routing.go#L536 I (probably) solved this problem by using a lock. So lines 533-541 in my code look like this peersetMutex.Lock()if ps.TryAdd(prov.ID) { logger.Debugf("using provider: %s", prov) select { case peerOut <- *prov: sentprov = 1 case <-ctx.Done(): logger.Debug("context timed out sending more providers") return nil, ctx.Err() } }peersetMutex.Unlock() And then I modified the function in line 559 as follows: func() bool { peersetMutex.Lock() output:= !findAll && ps.Size() >= count peersetMutex.Unlock() return output } — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#750>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAI4ST7VWHI7JWLY5674BDUHK6ZRANCNFSM5GEYXY7A> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

nikosft · 2021-10-17T13:17:49Z

In order to determine whether the bug is there, I used a variable that counted the records sent through peerOut. So the validation code looked similar to the following:

if ps.TryAdd(prov.ID) {
	logger.Debugf("using provider: %s", prov)
	select {
	case peerOut <- *prov:
             returnedItems++
	case <-ctx.Done():
		logger.Debug("context timed out sending more providers")
                if returnedItems < ps.Size() {
                      logger.Debug("!!!Something strange happened!!!")
                 } 
		return nil, ctx.Err()
	}
}

May be try something like that, i.e., add somewhere a condition that checks how many providers we have send back before closing the channel.

vyzo · 2021-10-18T06:49:42Z

@aschmahmann thoughts on the matter?

aschmahmann · 2021-10-18T19:52:20Z

@vyzo without looking too deeply at this I'd say I'd believe that a context cancellation race could occur here. There's some tradeoff around if the channel is allowed to backup and hold open a goroutine due to backpressure or if there's a maximal amount of waiting time. For example, using locking if findProvidersAsync returned a channel that didn't close until the caller had read everything off the channel it could be waiting for a while. That's not necessarily a bad thing, but it's a tradeoff here that doesn't seem very worthwhile.

This doesn't seem like a high priority item, but a fix that doesn't involve an excessive amount of locking seems reasonable. One fix that comes to mind is adding a condition in the cancel case that checks why it was cancelled, and if it was due to a Stop condition then decide when to abort. We'd have to decide what the abort semantics are though, such as:

immediately try sending down the channel or exiting with a send and default case switch (probably what I'd recommend)
wait a small amount of time (seems ok)
hang the goroutine until the consumer fully processes the channel (I understand this, but would need some convincing since this seems likely to cause problems for consumers)

WDYT?

nikosft · 2021-10-18T23:36:55Z

I run go test . -run TestProvidesMany -count=100 and three tests failed (with the error reported in issue #726). I modified

go-libp2p-kad-dht/routing.go

Line 533 in 5a8437e

if ps.TryAdd(prov.ID) {

As follows:

if ps.TryAdd(prov.ID) {
   peerOut <- *prov
}

Which removes the race condition. Then I run go test . -run TestProvidesMany -count=100 three times and gave my no error. I did not extensively test it but there is a chance that issue #726 is related to this one.

@aschmahmann WRT

One fix that comes to mind is adding a condition in the cancel case that checks why it was cancelled

The context included in the select statement is created here

go-libp2p-kad-dht/query.go

Line 108 in 5a8437e

followUpCtx, cancelFollowUp := context.WithCancel(ctx)

and it is always stopped when the requested number of providers is found. So it is not an exceptional case. This context is stopped with every FindProvidersAsync call and, at least in our experiments, the race condition appears approximately once every 1000 requests.

vyzo · 2021-10-19T09:12:49Z

I don't think we should hang the goroutine -- immediately sending down the channel seems like the sanest option.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FindProvidersAsync misses a provider #750

FindProvidersAsync misses a provider #750

nikosft commented Oct 17, 2021 •

edited

vyzo commented Oct 17, 2021 via email

nikosft commented Oct 17, 2021

vyzo commented Oct 18, 2021

aschmahmann commented Oct 18, 2021

nikosft commented Oct 18, 2021

vyzo commented Oct 19, 2021

FindProvidersAsync misses a provider #750

FindProvidersAsync misses a provider #750

Comments

nikosft commented Oct 17, 2021 • edited

vyzo commented Oct 17, 2021 via email

nikosft commented Oct 17, 2021

vyzo commented Oct 18, 2021

aschmahmann commented Oct 18, 2021

nikosft commented Oct 18, 2021

vyzo commented Oct 19, 2021

nikosft commented Oct 17, 2021 •

edited