New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panic in route53 plugin fails to release lock and stops responding. #6664
Comments
Interestingly enough I think #6669 is actually the root cause for the panic. Following the stack trace, the null value seems to stem from
Notice the second argument is nil (0x0) with 26 (0x1a) in length. Looking at the call to func Less(a *Elem, name string) int { return less(name, a.Name()) } it seems The implementation of func (e *Elem) Name() string {
if e.name != "" {
return e.name
}
for _, rrs := range e.m {
e.name = rrs[0].Header().Name
return e.name
}
return ""
} I'm not familiar with concurrency in Go well enough to know if it is possible, but |
This package main
import (
"sync"
"time"
)
type S struct {
F string
lock sync.RWMutex
}
func (s *S) GetSet(value string) string {
if s.F != "" {
return s.F
}
s.F = value
return s.F
}
func Check(a string) {
if a == "" {
return
}
if a[0] == '1' { // line 25
println(a[0])
}
}
func main() {
s := &S{}
go func() {
for {
s.lock.Lock()
s.F = ""
s.lock.Unlock()
time.Sleep(10)
}
}()
for i := 0; i < 100; i++ {
go func() {
for {
s.lock.RLock()
Check(s.GetSet("asdf"))
s.lock.RUnlock()
}
}()
}
time.Sleep(1 * time.Hour)
} is a very simplified toy reproduction of what logically causes the panic. it outputs
In the example |
@dankilman thanks for report! (and digging into this). It def looks like an unprotected access, I'll get that fixed. |
@dilyevsky, sorry to bother you, but any chance you have an ETA on this fix? |
What happened:
We use the
route53
plugin with a basic configThis works well for almost all requests. However, at least once a month (but not much more), some DNS query causes this panic (from the
errors
plugin enabled):Such rare occurrence would not be an issue if it only affected the "bad query" (which we have yet to identify even with the
log
pluginclass error
configured).However, it seems that this panic occurs here
so nothing ever releases this lock. The observed behavior afterwards is that CoreDNS stops responding to any request made to our internal domain (timing out) and starts leaking memory until being terminated after breaching its memory limit.
What you expected to happen:
For the lock to always release regardless (and of less importance in our case, for the panic to not occur in the first place)
How to reproduce it (as minimally and precisely as possible):
Sadly, we are unable to reproduce this issue as of yet.
Environment:
We are running on EKS with K8s v1.29
The text was updated successfully, but these errors were encountered: