Intermittent NXDOMAIN when running a MPI job with hundreds of pods #6658

bala19 · 2024-04-30T07:44:09Z

What happened:
We are observing CoreDNS returning NXDOMAIN intermittently for FQDN of cluster local domain names of pods that are running as part of Kubeflow MPIJob. The issue is observed only at scale when there are hundreds of pods running as part of the same job. If we make thousands of DNS queries within a few seconds for the FQDN of all the pods in a job, NXDOMAIN is returned for 2 to 15 times (< 1%). The NXDOMAIN response is returned for random pods in the job every time and it is also returned by random CoreDNS pods. For the same domain name of a active pod, CoreDNS returns NOERROR, followed by NXDOMAIN and then a NOERROR, all within a few seconds.

We saw this issue in 10 nodes (EC2 m5.xlarge) running as part of an EKS cluster with Kubernetes version 1.24. MPI job uses pods with a subdomain pointing to a headless service.

What you expected to happen:
CoreDNS should not return intermittent NXDOMAIN for active pods.

How to reproduce it (as minimally and precisely as possible):
Will add a minimal repro MPIJob

Anything else we need to know?:
We want to understand how CoreDNS and Kubernetes plugin handles cache for cluster local domains and if there is a cache sync issue causing this.

Environment:

the version of CoreDNS: v1.9.3-eksbuild.11 (also reproducible in v1.10.1-eksbuild.7)
Corefile:

.:53 {
        log
        errors
        health {
          lameduck 5s
        }
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
          endpoint_pod_names
        }
        cache 30
        prometheus :9153
        forward . /etc/resolv.conf
        loop
        reload
        loadbalance
    }

logs, if applicable: CoreDNS logs

[pod/coredns-57c45885c5-9gxlg/coredns] [INFO] 10.3.245.19:59714 - 59112 "A IN worker-107.worker.default.svc.cluster.local. udp 142 false 4096" NXDOMAIN qr,aa,rd 224 0.000116152s
[pod/coredns-57c45885c5-bkr9d/coredns] [INFO] 10.3.245.19:33152 - 5311 "A IN worker-109.worker.default.svc.cluster.local. udp 142 false 4096" NXDOMAIN qr,aa,rd 224 0.000143503s
[pod/coredns-57c45885c5-9gxlg/coredns] [INFO] 10.3.245.19:49892 - 14415 "A IN worker-306.worker.default.svc.cluster.local. udp 142 false 4096" NXDOMAIN qr,aa,rd 224 0.000120772s
[pod/coredns-57c45885c5-bkr9d/coredns] [INFO] 10.3.245.19:44248 - 53721 "A IN worker-198.worker.default.svc.cluster.local. udp 142 false 4096" NXDOMAIN qr,aa,rd 224 0.000123284s
[pod/coredns-57c45885c5-bkr9d/coredns] [INFO] 10.3.245.19:34874 - 15278 "A IN worker-199.worker.default.svc.cluster.local. udp 142 false 4096" NXDOMAIN qr,aa,rd 224 0.000116685s
[pod/coredns-57c45885c5-bkr9d/coredns] [INFO] 10.3.245.19:48759 - 7303 "A IN worker-200.worker.default.svc.cluster.local. udp 142 false 4096" NXDOMAIN qr,aa,rd 224 0.000343288s

OS (e.g: cat /etc/os-release):
Others:

The text was updated successfully, but these errors were encountered:

bala19 added the bug label Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent NXDOMAIN when running a MPI job with hundreds of pods #6658

Intermittent NXDOMAIN when running a MPI job with hundreds of pods #6658

bala19 commented Apr 30, 2024

Intermittent NXDOMAIN when running a MPI job with hundreds of pods #6658

Intermittent NXDOMAIN when running a MPI job with hundreds of pods #6658

Comments

bala19 commented Apr 30, 2024