Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BoundedFrequencyRunner Loop maybe into an infinite loop #103286

Closed
magic-hk opened this issue Jun 29, 2021 · 23 comments · Fixed by #104014
Closed

BoundedFrequencyRunner Loop maybe into an infinite loop #103286

magic-hk opened this issue Jun 29, 2021 · 23 comments · Fixed by #104014
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@magic-hk
Copy link

magic-hk commented Jun 29, 2021

What happened:

func main() {
	runner := NewBoundedFrequencyRunner("syncFunc", syncFunc, 3*time.Second, 10*time.Second, 1)
	go runner.Loop(wait.NeverStop)
	select {}
}
func syncFunc() {
        fmt.Println("do syncFunc")
}

bfr.limiter.TryAccept() will always false in tryRun() and runner.Loop will into an infinite loop after 11 days and 14 hours ( 1000800s=1000800s/10s/cycle= 100080cycle)

What you expected to happen:

No matter how long runner.Loop runs, syncFunc() can be called every 10 seconds

How to reproduce it (as minimally and precisely as possible):

1、modify tryRun()

func (bfr *BoundedFrequencyRunner) tryRun() {
	bfr.mu.Lock()
	defer bfr.mu.Unlock()
        
        //---------------Modify----------------------
	//if bfr.limiter.TryAccept() {
	//	// We're allowed to run the function right now.
	//	bfr.fn()
	//	bfr.lastRun = bfr.timer.Now()
	//	bfr.timer.Stop()
	//	bfr.timer.Reset(bfr.maxInterval)
	//	klog.V(3).Infof("%s: ran, next possible in %v, periodic in %v", bfr.name, bfr.minInterval, bfr.maxInterval)
	//	return
	//}
	for i := 1; i < 100000000; i++ {
		if !bfr.limiter.TryAccept() {
                       // will print 100080
			klog.Infof("----un accept %v", i)
			break
		}
	}
       //------------------------------------------
      
	// It can't run right now, figure out when it can run next.
	elapsed := bfr.timer.Since(bfr.lastRun)   // how long since last run
	nextPossible := bfr.minInterval - elapsed // time to next possible run
	nextScheduled := bfr.timer.Remaining()    // time to next scheduled run
	klog.V(4).Infof("%s: %v since last run, possible in %v, scheduled in %v", bfr.name, elapsed, nextPossible, nextScheduled)
	
       //---------------Modify----------------------
        if nextPossible < 0 {
		klog.Infof("----nextScheduled %v < 0", nextScheduled)
		return
	}
        //------------------------------------------

	// It's hard to avoid race conditions in the unit tests unless we always reset
	// the timer here, even when it's unchanged
	if nextPossible < nextScheduled {
		nextScheduled = nextPossible
	}
	bfr.timer.Stop()
	bfr.timer.Reset(nextScheduled)
}

2、limiter.advance

last := lim.last
	if now.Before(last) {
		last = now
	}

	// Avoid making delta overflow below when last is very old.
	maxElapsed := lim.limit.durationFromTokens(float64(lim.burst) - lim.tokens)

        //---------------Modify----------------------
	//elapsed := now.Sub(last)
	//if elapsed > maxElapsed {
	//	elapsed = maxElapsed
	//}
       // if no event, elapsed always = maxInterval > minInterval >=maxElapsed 
	elapsed := maxElapsed
        //------------------------------------------

	// Calculate the new number of tokens, due to time that passed.
	delta := lim.limit.tokensFromDuration(elapsed)
	tokens := lim.tokens + delta
	if burst := float64(lim.burst); tokens > burst {
		tokens = burst
	}

	return now, last, tokens

if TryAccept() is false, and elapsed = bfr.maxInterval, nextScheduled=nextPossible < 0, and runner.Loop into an infinite loop(TryAccept() will always false, and syncFunc() can not be invoked )

Anything else we need to know?:

if miniInterval is not 2^*, TokenBucketRateLimiter.limit will loss of accuracy, and tokens have a deviation value of delta-1 every maxInterval duration。

if nextScheduled less than zero should set tokens to zero as a fix

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:
@magic-hk magic-hk added the kind/bug Categorizes issue or PR as related to a bug. label Jun 29, 2021
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 29, 2021
@pacoxu
Copy link
Member

pacoxu commented Jun 29, 2021

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 29, 2021
@aojea
Copy link
Member

aojea commented Jul 7, 2021

@danwinship @thockin you are the ones that knows better the NewBoundedFrequencyRunner

@thockin thockin self-assigned this Jul 8, 2021
@thockin
Copy link
Member

thockin commented Jul 8, 2021

Where is 1000800s coming from?

@thockin
Copy link
Member

thockin commented Jul 8, 2021

I'm having a little trouble following the repro here. It's clear there's some bound-checking to be done to prevent negative values, but I am not clear what's causing it to go negative or what your repro is trying to prove. Help me understand?

@magic-hk
Copy link
Author

magic-hk commented Jul 9, 2021

Once bfr.limiter.TryAccept() is false and nextScheduled is negative value, bfr.limiter.TryAccept() will always be false which lead to bfr.fn() whill never be invoked any more and BoundedFrequencyRunner Loop into an infinite loop which lead to cpu 100%. I'm having this problem in a production environment. The same programs running in different clusters filled the CPU after 11 days and 14 hours and pprof indicates that the problem is with Loop().

@magic-hk
Copy link
Author

magic-hk commented Jul 9, 2021

I think if bfr.limiter.TryAccept() is false and nextScheduled is negative value, should reset tokens to zero and nextScheduled to minInterval or maxInterval

@thockin
Copy link
Member

thockin commented Jul 9, 2021 via email

@magic-hk
Copy link
Author

magic-hk commented Jul 9, 2021

Because tokens is a cumulative value,once the deviation is generated, it cannot be corrected automatically in the case of periodic execution, waitDuration will always more than zero.

@magic-hk
Copy link
Author

magic-hk commented Jul 9, 2021

I'm having this problem in a production environment, and the log information proves this fact

@magic-hk
Copy link
Author

magic-hk commented Jul 9, 2021

The time to cause the problem depends on the minInterval.

@thockin
Copy link
Member

thockin commented Jul 9, 2021

hi, I'm feeling rather dumb. Which tokens value do you mean Deep in golang.org/x/time/rate?

I'm not disputing that there's something amiss, but frankly I don't understand the repro. Your changes break the functionality.
How do I get INTO this situation, without modifying all of the libs in question?

@magic-hk
Copy link
Author

magic-hk commented Jul 9, 2021

Run this and wait 11 days and 14 hours(i think i maybe need to find a more min minInterval, which can lead to this problem)
func main() {
runner := NewBoundedFrequencyRunner("syncFunc", syncFunc, 3time.Second, 10time.Second, 1)
go runner.Loop(wait.NeverStop)
select {}
}
func syncFunc() {
fmt.Println("do syncFunc")
}

@thockin
Copy link
Member

thockin commented Jul 9, 2021 via email

@magic-hk
Copy link
Author

magic-hk commented Jul 9, 2021

Haha, 11 days and 14 hours magic number decide by minInterval and maxInterval. MinInterval decides cycle value(100080).You can change minInterval to 4s, and repro will need 100080*4s.
The root cause is the token deviation caused by the loss of limit accuracy.
If i find one more min minInterval, i wll tell you. It need to try.

@magic-hk
Copy link
Author

magic-hk commented Jul 9, 2021

runner := NewBoundedFrequencyRunner("syncFunc", syncFunc, 1031425*time.Microsecond, 2*time.Second, 1)

Cycle:8
it will repro after 16s

@thockin
Copy link
Member

thockin commented Jul 9, 2021

I see it. Changing all the libs to float64 makes THIS repro go away, but I'll dig into it so that I can plausibly explain it.

@thockin
Copy link
Member

thockin commented Jul 9, 2021

Here's a simpler repro, cutting BFR out of the loop:

package main

import (
    "fmt"
    "time"

    "golang.org/x/time/rate"
)

func main() {
    minInterval := 1031425 * time.Microsecond
    qps := float32(time.Second) / float32(minInterval)
    limiter := rate.NewLimiter(rate.Limit(qps), 1)
    for {
        if limiter.Allow() {
            syncFunc()
            time.Sleep(2 * time.Second)
        } else {
            fmt.Printf("limiter said no, retry in 1s\n")                        
            time.Sleep(time.Second)
        }   
    }   
}

func syncFunc() {
    fmt.Printf("------------------------------- do syncFunc %v --------------------------------\n\n\n\n", time.Now())
}

@thockin
Copy link
Member

thockin commented Jul 9, 2021

golang/go#46579

@thockin
Copy link
Member

thockin commented Jul 9, 2021

Fix proposed in https://golang.org/cl/325289

@magic-hk
Copy link
Author

OK, thanks

@dcbw
Copy link
Member

dcbw commented Jul 22, 2021

/triage accepted
Let's keep this open for now to track when/if that upstream PR lands.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 22, 2021
@thockin
Copy link
Member

thockin commented Jul 29, 2021

The upstream fix merged. We need to update our dep on golang.org/x/time/rate to pick it up.

@thockin
Copy link
Member

thockin commented Jul 29, 2021

#104014 offered

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants