Skip to content

Commit

Permalink
add Usage example for Demo
Browse files Browse the repository at this point in the history
  • Loading branch information
googs1025 committed Apr 26, 2024
1 parent 34c0f34 commit 007edaa
Show file tree
Hide file tree
Showing 3 changed files with 49 additions and 1 deletion.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Take a look at the [concepts](https://jobset.sigs.k8s.io/docs/concepts/) page fo

- **Support for multi-template jobs**: JobSet models a distributed training workload as a group of K8s Jobs. This allows a user to easily specify different pod templates for different distinct groups of pods (e.g. a leader, workers, parameter servers, etc.), something which cannot be done by a single Job.

- **Automatic headless service configuration and lifecycle management**: ML and HPC frameworks require a stable network endpoint for each worker in the distributed workload, and since pod IPs are dynamically assigned and can change between restarts, stable pod hostnames are required for distributed training on k8s, By default, JobSet uses [IndexedJobs](https://kubernetes.io/blog/2021/04/19/introducing-indexed-jobs/) to establish stable pod hostnames, and does automatic configuration and lifecycle management of the headless service to trigger DNS record creations and establish network connectivity via pod hostnames.You can run this [example](./examples/simple/jobset-with-network.yaml) yourself to see how exclusive placement works.
- **Automatic headless service configuration and lifecycle management**: ML and HPC frameworks require a stable network endpoint for each worker in the distributed workload, and since pod IPs are dynamically assigned and can change between restarts, stable pod hostnames are required for distributed training on k8s, By default, JobSet uses [IndexedJobs](https://kubernetes.io/blog/2021/04/19/introducing-indexed-jobs/) to establish stable pod hostnames, and does automatic configuration and lifecycle management of the headless service to trigger DNS record creations and establish network connectivity via pod hostnames.You can run this [example](examples/network-policy/jobset-with-network.yaml) yourself to see how exclusive placement works.

- **Configurable success policies**: JobSet has [configurable success policies](https://github.com/kubernetes-sigs/jobset/blob/v0.5.0/examples/simple/success-policy.yaml) which target specific ReplicatedJobs, with operators to target `Any` or `All` of their child jobs. For example, you can configure the JobSet to be marked complete if and only if all pods that are part of the “worker” ReplicatedJob are completed. This enables users to use their compute resources more efficiently, allowing a workload to be declared successful and release the resources for the next workload more quickly.

Expand Down
48 changes: 48 additions & 0 deletions examples/network-policy/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
## Usage example
### Purpose
This document provides an example of achieving communication between different pods using a headless service with JobSet.

- apply and check whether the jobset's pod and svc start normally
```bash
root@VM-0-4-ubuntu:/home/ubuntu# vi jobset-network.yaml
root@VM-0-4-ubuntu:/home/ubuntu# kubectl apply -f jobset-network.yaml
jobset.jobset.x-k8s.io/network-jobset created
root@VM-0-4-ubuntu:/home/ubuntu# kubectl get pods
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
network-jobset-leader-0-0-5xnzz 1/1 Running 0 17m 10.6.2.27 cluster1-worker <none> <none>
network-jobset-workers-0-0-78k9j 1/1 Running 0 17m 10.6.1.16 cluster1-worker2 <none> <none>
network-jobset-workers-0-1-rmw42 1/1 Running 0 17m 10.6.2.28 cluster1-worker <none> <none>
root@VM-0-4-ubuntu:/home/ubuntu# kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
example ClusterIP None <none> <none> 19s
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 2d1h
```
- Use the exec command to enter the container
- We can first check the /etc/hosts file in the container. We can see that there is a domain name, such as: network-jobset-leader-0-0.example.default.svc.cluster.local
- Other containers can access the current pod through this domain name. In the same way, we can also access the domain names of other pods for network communication.
- For example: we can access the pods of network-jobset-workers-0-0-78k9j and network-jobset-workers-0-1-rmw42 respectively
```bash
root@VM-0-4-ubuntu:/home/ubuntu# kubectl exec -it network-jobset-leader-0-0-5xnzz -- sh
/ # cat /etc/hosts
# Kubernetes-managed hosts file.
127.0.0.1 localhost
...
10.6.2.27 network-jobset-leader-0-0.example.default.svc.cluster.local network-jobset-leader-0-0
/ # ping network-jobset-workers-0-.example.default.svc.cluster.local
ping: bad address 'network-jobset-workers-0-.example.default.svc.cluster.local'
/ # ping network-jobset-workers-0-0.example.default.svc.cluster.local
PING network-jobset-workers-0-0.example.default.svc.cluster.local (10.6.1.16): 56 data bytes
64 bytes from 10.6.1.16: seq=0 ttl=62 time=0.121 ms
64 bytes from 10.6.1.16: seq=1 ttl=62 time=0.093 ms
64 bytes from 10.6.1.16: seq=2 ttl=62 time=0.094 ms
64 bytes from 10.6.1.16: seq=3 ttl=62 time=0.103 ms
--- network-jobset-workers-0-0.example.default.svc.cluster.local ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max = 0.093/0.102/0.121 ms
/ # ping network-jobset-workers-0-1.example.default.svc.cluster.local
PING network-jobset-workers-0-1.example.default.svc.cluster.local (10.6.2.28): 56 data bytes
64 bytes from 10.6.2.28: seq=0 ttl=63 time=0.068 ms
64 bytes from 10.6.2.28: seq=1 ttl=63 time=0.072 ms
64 bytes from 10.6.2.28: seq=2 ttl=63 time=0.079 ms
--- network-jobset-workers-0-1.example.default.svc.cluster.local ping statistics ---
```
File renamed without changes.

0 comments on commit 007edaa

Please sign in to comment.