Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

集群加入gpu后服务报错 #153

Open
zzzzzzyzz opened this issue Dec 7, 2022 · 1 comment
Open

集群加入gpu后服务报错 #153

zzzzzzyzz opened this issue Dec 7, 2022 · 1 comment

Comments

@zzzzzzyzz
Copy link

我想问一下,就是在集群中加入了GPU之后,刚开始没报错,过了几个小时之后就是 cattle-cluster-agent canal coredns
kubeflow-prometheus-adapter 这些服务不停的重启更新,我看了一下大概是这三种,这种问题该怎么解决呢?
Readiness probe failed: Get http://10.42.0.14:9090/-/ready: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Liveness probe failed: Get http://10.42.0.2:8080/health: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 210.28.18.30 210.28.16.26 210.28.18.26

@zzzzzzyzz
Copy link
Author

就是单机的集群配置都配好了之后,过一段时间就会在rancher中的许多服务里显示 Deployment does not have minimum availability. 然后看日志的话就是Readiness probe failed Liveness probe failed这样的问题
Readiness probe failed: Get http://localhost:9099/readiness: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)类似这样的。请问这种情况是资源不够吗,还是别的原因呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant