Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent NXDOMAIN when running a MPI job with hundreds of pods #6658

Open
bala19 opened this issue Apr 30, 2024 · 0 comments
Open

Intermittent NXDOMAIN when running a MPI job with hundreds of pods #6658

bala19 opened this issue Apr 30, 2024 · 0 comments
Labels

Comments

@bala19
Copy link

bala19 commented Apr 30, 2024

What happened:
We are observing CoreDNS returning NXDOMAIN intermittently for FQDN of cluster local domain names of pods that are running as part of Kubeflow MPIJob. The issue is observed only at scale when there are hundreds of pods running as part of the same job. If we make thousands of DNS queries within a few seconds for the FQDN of all the pods in a job, NXDOMAIN is returned for 2 to 15 times (< 1%). The NXDOMAIN response is returned for random pods in the job every time and it is also returned by random CoreDNS pods. For the same domain name of a active pod, CoreDNS returns NOERROR, followed by NXDOMAIN and then a NOERROR, all within a few seconds.

We saw this issue in 10 nodes (EC2 m5.xlarge) running as part of an EKS cluster with Kubernetes version 1.24. MPI job uses pods with a subdomain pointing to a headless service.

What you expected to happen:
CoreDNS should not return intermittent NXDOMAIN for active pods.

How to reproduce it (as minimally and precisely as possible):
Will add a minimal repro MPIJob

Anything else we need to know?:
We want to understand how CoreDNS and Kubernetes plugin handles cache for cluster local domains and if there is a cache sync issue causing this.

Environment:

  • the version of CoreDNS: v1.9.3-eksbuild.11 (also reproducible in v1.10.1-eksbuild.7)
  • Corefile:
.:53 {
        log
        errors
        health {
          lameduck 5s
        }
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
          endpoint_pod_names
        }
        cache 30
        prometheus :9153
        forward . /etc/resolv.conf
        loop
        reload
        loadbalance
    }
  • logs, if applicable: CoreDNS logs
[pod/coredns-57c45885c5-9gxlg/coredns] [INFO] 10.3.245.19:59714 - 59112 "A IN worker-107.worker.default.svc.cluster.local. udp 142 false 4096" NXDOMAIN qr,aa,rd 224 0.000116152s
[pod/coredns-57c45885c5-bkr9d/coredns] [INFO] 10.3.245.19:33152 - 5311 "A IN worker-109.worker.default.svc.cluster.local. udp 142 false 4096" NXDOMAIN qr,aa,rd 224 0.000143503s
[pod/coredns-57c45885c5-9gxlg/coredns] [INFO] 10.3.245.19:49892 - 14415 "A IN worker-306.worker.default.svc.cluster.local. udp 142 false 4096" NXDOMAIN qr,aa,rd 224 0.000120772s
[pod/coredns-57c45885c5-bkr9d/coredns] [INFO] 10.3.245.19:44248 - 53721 "A IN worker-198.worker.default.svc.cluster.local. udp 142 false 4096" NXDOMAIN qr,aa,rd 224 0.000123284s
[pod/coredns-57c45885c5-bkr9d/coredns] [INFO] 10.3.245.19:34874 - 15278 "A IN worker-199.worker.default.svc.cluster.local. udp 142 false 4096" NXDOMAIN qr,aa,rd 224 0.000116685s
[pod/coredns-57c45885c5-bkr9d/coredns] [INFO] 10.3.245.19:48759 - 7303 "A IN worker-200.worker.default.svc.cluster.local. udp 142 false 4096" NXDOMAIN qr,aa,rd 224 0.000343288s
  • OS (e.g: cat /etc/os-release):

  • Others:

@bala19 bala19 added the bug label Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant