Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Sagemaker instances to node-affinity #36

Merged
merged 4 commits into from May 14, 2024

Conversation

sam6134
Copy link
Collaborator

@sam6134 sam6134 commented May 9, 2024

Issue #, if available:

Description of changes:

This change adds the instances as part of the node-affinity in the helm to be able to enable monitoring for sage-maker instances.

Testing
Manually updated the config for sage-maker cluster

miconeil@80a997366eb0 EpsilonDataStoreReaderService % kubectl get nodes --show-labels=true
NAME                           STATUS   ROLES    AGE   VERSION               LABELS
hyperpod-i-01b0c5ad8bcc02027   Ready    <none>   13h   v1.29.0-eks-5e0fdde   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=hyperpod-i-01b0c5ad8bcc02027,kubernetes.io/os=linux,node.kubernetes.io/instance-type=ml.g5.xlarge,sagemaker.amazonaws.com/cluster-name=jenna-test-gpu,sagemaker.amazonaws.com/instance-group-name=group1
hyperpod-i-09d28431dfd94e184   Ready    <none>   13h   v1.29.0-eks-5e0fdde   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=hyperpod-i-09d28431dfd94e184,kubernetes.io/os=linux,node.kubernetes.io/instance-type=ml.g5.xlarge,sagemaker.amazonaws.com/cluster-name=jenna-test-gpu,sagemaker.amazonaws.com/instance-group-name=group1


miconeil@80a997366eb0 EpsilonDataStoreReaderService % kubectl get all -n amazon-cloudwatch
NAME                                                                  READY   STATUS              RESTARTS   AGE
pod/amazon-cloudwatch-observability-controller-manager-65bcd4bxp28r   1/1     Running             0          11m
pod/cloudwatch-agent-4zcrg                                            1/1     Running             0          10h
pod/cloudwatch-agent-cb8sb                                            1/1     Running             0          10h
pod/dcgm-exporter-kmqc6                                               0/1     ContainerCreating   0          6s
pod/dcgm-exporter-r2nmh                                               0/1     ContainerCreating   0          6s
pod/fluent-bit-gssdn                                                  1/1     Running             0          10h
pod/fluent-bit-r6xng                                                  1/1     Running             0          10h

NAME                                                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/amazon-cloudwatch-observability-webhook-service   ClusterIP   172.20.143.125   <none>        443/TCP                      10h
service/cloudwatch-agent                                  ClusterIP   172.20.37.233    <none>        4315/TCP,4316/TCP,2000/TCP   10h
service/cloudwatch-agent-headless                         ClusterIP   None             <none>        4315/TCP,4316/TCP,2000/TCP   10h
service/cloudwatch-agent-monitoring                       ClusterIP   172.20.11.163    <none>        8888/TCP                     10h
service/cloudwatch-agent-windows                          ClusterIP   172.20.190.75    <none>        4315/TCP,4316/TCP,2000/TCP   10h
service/cloudwatch-agent-windows-headless                 ClusterIP   None             <none>        4315/TCP,4316/TCP,2000/TCP   10h
service/cloudwatch-agent-windows-monitoring               ClusterIP   172.20.120.137   <none>        8888/TCP                     10h
service/dcgm-exporter-service                             ClusterIP   172.20.196.216   <none>        9400/TCP                     10h
service/neuron-monitor-service                            ClusterIP   172.20.48.129    <none>        8000/TCP                     10h

NAME                                      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR              AGE
daemonset.apps/cloudwatch-agent           2         2         2       2            2           kubernetes.io/os=linux     10h
daemonset.apps/cloudwatch-agent-windows   0         0         0       0            0           kubernetes.io/os=windows   10h
daemonset.apps/dcgm-exporter              2         2         2       2            2           kubernetes.io/os=linux     10h
daemonset.apps/fluent-bit                 2         2         2       2            2           kubernetes.io/os=linux     10h
daemonset.apps/fluent-bit-windows         0         0         0       0            0           kubernetes.io/os=windows   10h
daemonset.apps/neuron-monitor             0         0         0       0            0           <none>                     10h

Metrics flowing -
Screenshot 2024-05-09 at 15 45 17

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@sam6134 sam6134 requested a review from movence May 10, 2024 16:33
@sam6134 sam6134 requested a review from sky333999 May 13, 2024 17:12
@movence movence removed the request for review from sky333999 May 14, 2024 12:52
@sam6134 sam6134 merged commit 4be61c3 into aws-observability:main May 14, 2024
0 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants