Panic in route53 plugin fails to release lock and stops responding. #6664

dankilman · 2024-05-06T10:28:31Z

What happened:
We use the route53 plugin with a basic config

route53 <SOME_INTERNAL_DOMAIN>.:<SOME_HOSTED_ZONE> {
  fallthrough
}

This works well for almost all requests. However, at least once a month (but not much more), some DNS query causes this panic (from the errors plugin enabled):

[INFO] SIGTERM: Shutting down servers then terminating
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/pkg/mod/github.com/miekg/dns@v1.1.50/server.go:533 +0x485
2024-05-05 21:14:46.210	
created by github.com/miekg/dns.(*Server).serveUDP
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/pkg/mod/github.com/miekg/dns@v1.1.50/server.go:603 +0x1dc
2024-05-05 21:14:46.210	
github.com/miekg/dns.(*Server).serveUDPPacket(0xc000812000, 0xc000e9ea50?, {0xc000f64400, 0x24, 0x200}, {0x269edc0?, 0xc000486028}, 0xc0009f41c0, {0x0, 0x0})
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/pkg/mod/github.com/miekg/dns@v1.1.50/server.go:659 +0x462
2024-05-05 21:14:46.210	
github.com/miekg/dns.(*Server).serveDNS(0xc000812000, {0xc000f64400, 0x24, 0x200}, 0xc000ae3880)
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/pkg/mod/github.com/miekg/dns@v1.1.50/server.go:37 +0x2f
2024-05-05 21:14:46.210	
github.com/miekg/dns.HandlerFunc.ServeDNS(0xc000f64400?, {0x26a1e10?, 0xc000ae3880?}, 0xc0010d74d0?)
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/github.com/coredns/coredns/core/dnsserver/server.go:178 +0x9b
2024-05-05 21:14:46.210	
github.com/coredns/coredns/core/dnsserver.(*Server).ServePacket.func1({0x26a1e10, 0xc000ae3880}, 0xc0010d7401?)
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/github.com/coredns/coredns/core/dnsserver/server.go:364 +0x867
2024-05-05 21:14:46.210	
github.com/coredns/coredns/core/dnsserver.(*Server).ServeDNS(0xc0001a6a80, {0x2698cd8, 0xc000c8f2f0}, {0x26a0340, 0xc000b00498}, 0xc0010d74d0)
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/github.com/coredns/coredns/plugin/metrics/handler.go:27 +0x255
2024-05-05 21:14:46.210	
github.com/coredns/coredns/plugin/metrics.(*Metrics).ServeDNS(0xc00063dc20, {0x2698cd8, 0xc000c8f2f0}, {0x26a0340?, 0xc000b00498}, 0xc0010d74d0)
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/github.com/coredns/coredns/plugin/plugin.go:80 +0x27e
2024-05-05 21:14:46.210	
github.com/coredns/coredns/plugin.NextOrFailure({0x21dab12?, 0x1?}, {0x26832a0, 0xc000488b40}, {0x2698cd8, 0xc000c8f2f0}, {0x26a0130, 0xc000b2dd00}, 0x0?)
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/github.com/coredns/coredns/plugin/errors/errors.go:84 +0x87
2024-05-05 21:14:46.210	
github.com/coredns/coredns/plugin/errors.(*errorHandler).ServeDNS(0xc000488b40, {0x2698cd8?, 0xc000c8f2f0?}, {0x26a0130, 0xc000b2dd00}, 0xc0010d74d0)
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/github.com/coredns/coredns/plugin/plugin.go:80 +0x27e
2024-05-05 21:14:46.210	
github.com/coredns/coredns/plugin.NextOrFailure({0x21d3f18?, 0x2?}, {0x2688fe8, 0xc00047ff50}, {0x2698cd8, 0xc000c8f2f0}, {0x26a0130, 0xc000b2dd00}, 0xc0007242c0?)
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/github.com/coredns/coredns/plugin/log/log.go:36 +0x2d3
2024-05-05 21:14:46.210	
github.com/coredns/coredns/plugin/log.Logger.ServeDNS({{0x2688f98, 0xc0004822e8}, {0xc000488ba0, 0x1, 0x1}, {}}, {0x2698cd8, 0xc000c8f2f0}, {0x26a0130?, 0xc000b2dd00?}, ...)
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/github.com/coredns/coredns/plugin/plugin.go:80 +0x27e
2024-05-05 21:14:46.210	
github.com/coredns/coredns/plugin.NextOrFailure({0x21d12f1?, 0x1?}, {0x2688f98, 0xc0004822e8}, {0x2698cd8, 0xc000c8f2f0}, {0x26a01e0, 0xc000b2dd40}, 0x561c45?)
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/github.com/coredns/coredns/plugin/loadbalance/handler.go:21 +0xf6
2024-05-05 21:14:46.210	
github.com/coredns/coredns/plugin/loadbalance.LoadBalance.ServeDNS({{0x26831d8?, 0xc0001c4a00?}, 0x22f2ec8?}, {0x2698cd8, 0xc000c8f2f0}, {0x26a01e0?, 0xc000b2dd40}, 0xc000e9eab0?)
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/github.com/coredns/coredns/plugin/plugin.go:80 +0x27e
2024-05-05 21:14:46.210	
github.com/coredns/coredns/plugin.NextOrFailure({0x21dc21e?, 0x40e007?}, {0x26831d8, 0xc0001c4a00}, {0x2698cd8, 0xc000c8f2f0}, {0x26a00d8, 0xc000b004b0}, 0xc00005e0d8?)
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/github.com/coredns/coredns/plugin/cache/handler.go:41 +0xd8d
2024-05-05 21:14:46.210	
github.com/coredns/coredns/plugin/cache.(*Cache).ServeDNS(0xc0001c4a00, {0x2698cd8?, 0xc000c8f2f0}, {0x26a00d8, 0xc000b004b0}, 0xc0010d74d0)
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/github.com/coredns/coredns/plugin/cache/handler.go:107
2024-05-05 21:14:46.210	
github.com/coredns/coredns/plugin/cache.(*Cache).doRefresh(...)
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/github.com/coredns/coredns/plugin/plugin.go:80 +0x27e
2024-05-05 21:14:46.210	
github.com/coredns/coredns/plugin.NextOrFailure({0x21d28f1?, 0xc000c8f2f0?}, {0x2683890, 0xc0001a6180}, {0x2698cd8, 0xc000c8f2f0}, {0x269ff20, 0xc0006ca000}, 0x0?)
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/github.com/coredns/coredns/plugin/route53/route53.go:125 +0x37f
2024-05-05 21:14:46.210	
github.com/coredns/coredns/plugin/route53.(*Route53).ServeDNS(0xc0001a6180, {0x2698cd8, 0xc000c8f2f0}, {0x269ff20?, 0xc0006ca000?}, 0xc0010d7560)
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/github.com/coredns/coredns/plugin/file/lookup.go:96 +0x851
2024-05-05 21:14:46.210	
github.com/coredns/coredns/plugin/file.(*Zone).Lookup(0xc0010b69a0, {0x2698cd8, 0xc000c8f2f0}, {0xc0010d7560, {0x269ff20, 0xc0006ca000}, {0x0, 0x0}, 0x200, 0x0, ...}, ...)
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/github.com/coredns/coredns/plugin/file/tree/tree.go:145 +0x2a
2024-05-05 21:14:46.210	
github.com/coredns/coredns/plugin/file/tree.(*Tree).Search(0xc0010b69a0?, {0xc000dfb350?, 0x13?})
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/github.com/coredns/coredns/plugin/file/tree/tree.go:155 +0x45
2024-05-05 21:14:46.210	
github.com/coredns/coredns/plugin/file/tree.(*Node).search(0x757448?, {0xc000dfb350, 0x13})
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/github.com/coredns/coredns/plugin/file/tree/elem.go:101 +0x3a
2024-05-05 21:14:46.210	
github.com/coredns/coredns/plugin/file/tree.Less(0xc0003fa637?, {0xc000dfb350, 0x13})
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/github.com/coredns/coredns/plugin/file/tree/less.go:23 +0xaf
2024-05-05 21:14:46.210	
github.com/coredns/coredns/plugin/file/tree.less({0xc000dfb350, 0x13}, {0x0, 0x1a})
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/pkg/mod/github.com/miekg/dns@v1.1.50/labels.go:163 +0x21
2024-05-05 21:14:46.210	
github.com/miekg/dns.PrevLabel({0x0?, 0xc000b46708?}, 0x7f347d2adf18?)
2024-05-05 21:14:46.210	
	/usr/local/go/src/runtime/panic.go:884 +0x213
2024-05-05 21:14:46.210	
panic({0x1ed81c0, 0x36eec00})
2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/github.com/coredns/coredns/core/dnsserver/server.go:266 +0x75
2024-05-05 21:14:46.210	
github.com/coredns/coredns/core/dnsserver.(*Server).ServeDNS.func1()
2024-05-05 21:14:46.210	
	/usr/local/go/src/runtime/debug/stack.go:24 +0x65
2024-05-05 21:14:46.210	
runtime/debug.Stack()
2024-05-05 21:14:46.210	
goroutine 17552119 [running]:
2024-05-05 21:14:46.209	
[ERROR] Recovered from panic in server: "dns://:53" runtime error: invalid memory address or nil pointer dereference

Such rare occurrence would not be an issue if it only affected the "bad query" (which we have yet to identify even with the log plugin class error configured).

However, it seems that this panic occurs here

h.zMu.RLock()
m.Answer, m.Ns, m.Extra, result = hostedZone.z.Lookup(ctx, state, qname)  // <--- panic down this stack
h.zMu.RUnlock()

so nothing ever releases this lock. The observed behavior afterwards is that CoreDNS stops responding to any request made to our internal domain (timing out) and starts leaking memory until being terminated after breaching its memory limit.

What you expected to happen:
For the lock to always release regardless (and of less importance in our case, for the panic to not occur in the first place)

How to reproduce it (as minimally and precisely as possible):
Sadly, we are unable to reproduce this issue as of yet.

Environment:
We are running on EKS with K8s v1.29

the version of CoreDNS: 1.10.1
Corefile:

    .:53 {
      # Enable error logging (with stacktraces)
      errors {
        stacktrace
      }
      # Enable request logging of erroneous requests
      log . {combined} {
        class error
      }

      # Expose http://:8080/health for liveness probes
      # Note, `health` is process wide, so can only be enabled once
      # https://coredns.io/plugins/health/
      health {
        # Respond as healthy during first 5 seconds of shutdown
        # Ensures that existing requests are handled successfully
        lameduck 5s
      }

      # Expose http://:8181/ready for readiness probe
      ready

      # Resolve cluster.local domains via the Kubernetes Plugin (e.g., `{svc}.{ns}.cluster.local`)
      kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        fallthrough in-addr.arpa ip6.arpa
        ttl 30
      }

      # Resolve ... via Route53, so we can resolve private domains
      route53 <redacted>.:%K8S_HZ_ID% {
        fallthrough
      }

      # Otherwise, fallback to the VPC's Route 53 Resolver
      # https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html
      forward . 169.254.169.253

      # Enable Prometheus metrics
      prometheus :9153
      
      # Enable caching with a TTL of 30s
      cache 30

      # Enable loop detection
      loop

      # Hot-load changes to this file
      reload

      # Randomize order of resolved records in A/AAAA requests
      loadbalance
    }

The text was updated successfully, but these errors were encountered:

dankilman · 2024-05-08T13:02:17Z

Interestingly enough I think #6669 is actually the root cause for the panic.

Following the stack trace, the null value seems to stem from

2024-05-05 21:14:46.210	
	/codebuild/output/src802558257/src/github.com/coredns/coredns/plugin/file/tree/less.go:23 +0xaf
2024-05-05 21:14:46.210	
github.com/coredns/coredns/plugin/file/tree.less({0xc000dfb350, 0x13}, {0x0, 0x1a})

Notice the second argument is nil (0x0) with 26 (0x1a) in length.

Looking at the call to less:

func Less(a *Elem, name string) int { return less(name, a.Name()) }

it seems a.Name() returns a partially constructed value.

The implementation of Name():

func (e *Elem) Name() string {
	if e.name != "" {
		return e.name
	}
	for _, rrs := range e.m {
		e.name = rrs[0].Header().Name
		return e.name
	}
	return ""
}

I'm not familiar with concurrency in Go well enough to know if it is possible, but Name() is being called by readers holding the RLock so the assumption is that the state will not be changed under their feet but in this case, the side effect will assign a value during read which may cause other readers to hold a corrupted value.

dankilman · 2024-05-09T00:51:40Z

This

package main

import (
	"sync"
	"time"
)

type S struct {
	F string
	lock sync.RWMutex
}

func (s *S) GetSet(value string) string {
	if s.F != "" {
		return s.F
	}
	s.F = value
	return s.F
}

func Check(a string) {
	if a == "" {
		return
	}
	if a[0] == '1' {  // line 25
		println(a[0])
	}
}

func main() {
	s := &S{}
	go func() {
		for {
			s.lock.Lock()
			s.F = ""
			s.lock.Unlock()
			time.Sleep(10)
		}
	}()
	for i := 0; i < 100; i++ {
		go func() {
			for {
				s.lock.RLock()
				Check(s.GetSet("asdf"))
				s.lock.RUnlock()
			}
		}()
	}
	time.Sleep(1 * time.Hour)
}

is a very simplified toy reproduction of what logically causes the panic. it outputs

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x1004b18c8]

goroutine 79 [running]:
main.Check(...)
        .../scratch_1.go:25
main.main.func2()
       .../scratch_1.go:44 +0xc8
created by main.main in goroutine 1
       .../scratch_1.go:41 +0x78

In the example GetSet corresponds with a.Name() of Elem and Check corresponds with PrevLabel (of dns) (which is called by less with input coming from a.Name())

dilyevsky · 2024-05-09T23:22:25Z

@dankilman thanks for report! (and digging into this). It def looks like an unprotected access, I'll get that fixed.

mcouthon · 2024-05-21T11:31:31Z

@dilyevsky, sorry to bother you, but any chance you have an ETA on this fix?

dankilman added the bug label May 6, 2024

dilyevsky self-assigned this May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic in route53 plugin fails to release lock and stops responding. #6664

Panic in route53 plugin fails to release lock and stops responding. #6664

dankilman commented May 6, 2024 •

edited

dankilman commented May 8, 2024 •

edited

dankilman commented May 9, 2024 •

edited

dilyevsky commented May 9, 2024

mcouthon commented May 21, 2024

Panic in route53 plugin fails to release lock and stops responding. #6664

Panic in route53 plugin fails to release lock and stops responding. #6664

Comments

dankilman commented May 6, 2024 • edited

dankilman commented May 8, 2024 • edited

dankilman commented May 9, 2024 • edited

dilyevsky commented May 9, 2024

mcouthon commented May 21, 2024

dankilman commented May 6, 2024 •

edited

dankilman commented May 8, 2024 •

edited

dankilman commented May 9, 2024 •

edited