You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
We are observing CoreDNS returning NXDOMAIN intermittently for FQDN of cluster local domain names of pods that are running as part of Kubeflow MPIJob. The issue is observed only at scale when there are hundreds of pods running as part of the same job. If we make thousands of DNS queries within a few seconds for the FQDN of all the pods in a job, NXDOMAIN is returned for 2 to 15 times (< 1%). The NXDOMAIN response is returned for random pods in the job every time and it is also returned by random CoreDNS pods. For the same domain name of a active pod, CoreDNS returns NOERROR, followed by NXDOMAIN and then a NOERROR, all within a few seconds.
We saw this issue in 10 nodes (EC2 m5.xlarge) running as part of an EKS cluster with Kubernetes version 1.24. MPI job uses pods with a subdomain pointing to a headless service.
What you expected to happen:
CoreDNS should not return intermittent NXDOMAIN for active pods.
How to reproduce it (as minimally and precisely as possible):
Will add a minimal repro MPIJob
Anything else we need to know?:
We want to understand how CoreDNS and Kubernetes plugin handles cache for cluster local domains and if there is a cache sync issue causing this.
Environment:
the version of CoreDNS: v1.9.3-eksbuild.11 (also reproducible in v1.10.1-eksbuild.7)
What happened:
We are observing CoreDNS returning NXDOMAIN intermittently for FQDN of cluster local domain names of pods that are running as part of Kubeflow MPIJob. The issue is observed only at scale when there are hundreds of pods running as part of the same job. If we make thousands of DNS queries within a few seconds for the FQDN of all the pods in a job, NXDOMAIN is returned for 2 to 15 times (< 1%). The NXDOMAIN response is returned for random pods in the job every time and it is also returned by random CoreDNS pods. For the same domain name of a active pod, CoreDNS returns NOERROR, followed by NXDOMAIN and then a NOERROR, all within a few seconds.
We saw this issue in 10 nodes (EC2 m5.xlarge) running as part of an EKS cluster with Kubernetes version 1.24. MPI job uses pods with a subdomain pointing to a headless service.
What you expected to happen:
CoreDNS should not return intermittent NXDOMAIN for active pods.
How to reproduce it (as minimally and precisely as possible):
Will add a minimal repro MPIJob
Anything else we need to know?:
We want to understand how CoreDNS and Kubernetes plugin handles cache for cluster local domains and if there is a cache sync issue causing this.
Environment:
OS (e.g:
cat /etc/os-release
):Others:
The text was updated successfully, but these errors were encountered: