Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Elastic JobSets #463

Open
kannon92 opened this issue Mar 21, 2024 · 6 comments · May be fixed by #529
Open

Support Elastic JobSets #463

kannon92 opened this issue Mar 21, 2024 · 6 comments · May be fixed by #529

Comments

@kannon92
Copy link
Contributor

What would you like to be added:

With Elastic Indexed jobs, it is possible to change completions/parallelism to down/up scale your jobs.

It would be nice to have something similar for JobSet.

Why is this needed:

Elastic jobs are an important usecase for autoscaling and other cases.

Implementation:

At a quick glance of the API this may be possible as replicas of a replicated job are not immutable so I think someone could patch the replicas of a ReplicatedJob to downscale or upscale.

But then I wonder what should we do with the existing replicated job?

And should we support ElasticIndexedJob with JobSet (so someone could patch the JobTemplate in a single replicated job?

@kannon92 kannon92 changed the title [RFC]: Elastic JobSet Discussion: Elastic JobSet Mar 21, 2024
@ahg-g
Copy link
Contributor

ahg-g commented Mar 21, 2024

Yes, I think we should think about allowing to autoscale the number of replicas in a replicatedJob! For example, the number of tpuslices (or more generally accelerator islands) supporting a large scale training job could scale down in case of failures.

But then I wonder what should we do with the existing replicated job?

The jobs are indexed, so a scale down means removing the higher order ones.

And should we support ElasticIndexedJob with JobSet (so someone could patch the JobTemplate in a single replicated job?

It is possible.

If the child jobs themselves should be elastic, then the operator could change the individual jobs directly. But I guess we could also allow changing that in bulk for all job replicas, but I need to hear a use case first.

@kannon92
Copy link
Contributor Author

So @ahg-g it sounds like this is supported as you are correct. Both replicas in ReplicatedJob are mutable and JobTemplate is mutable..

Maybe we should consider a task for this to at least document that this is possible?

I think Kueue or other use cases would be interested in this but not sure what we need in this repo.

@ahg-g
Copy link
Contributor

ahg-g commented Mar 21, 2024

We need to have tests for that to verify the behavior though.

@kannon92
Copy link
Contributor Author

Well good thing I tried it haha.

I think this code is blocking us from doing this.

https://github.com/kubernetes-sigs/jobset/blob/main/api/jobset/v1alpha2/jobset_webhook.go#L172

I tried a simple example and did a kubectl edit jobset and tried changing the replica of a ReplicatedJob.

Got:

error: jobsets.jobset.x-k8s.io "simple-no-ttl" could not be patched: admission webhook "vjobset.kb.io" denied the request: spec.replicatedJobs: Invalid value: []v1alpha2.ReplicatedJob{v1alpha2.ReplicatedJob{Name:"leader", Template:v1.JobTemplateSpec{ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:v1.JobSpec{Parallelism:(*int32)(0xc0001fe320), Completions:(*int32)(0xc0001fe324), ActiveDeadlineSeconds:(*int64)(nil), PodFailurePolicy:(*v1.PodFailurePolicy)(nil), BackoffLimit:(*int32)(0xc0001fe328), BackoffLimitPerIndex:(*int32)(nil), MaxFailedIndexes:(*int32)(nil), Selector:(*v1.LabelSelector)(nil), ManualSelector:(*bool)(nil), Template:v1.PodTemplateSpec{ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:v1.PodSpec{Volumes:[]v1.Volume(nil), InitContainers:[]v1.Container(nil), Containers:[]v1.Container{v1.Container{Name:"leader", Image:"bash:latest", Command:[]string{"bash", "-xc", "sleep 10000\n"}, Args:[]string(nil), WorkingDir:"", Ports:[]v1.ContainerPort(nil), EnvFrom:[]v1.EnvFromSource(nil), Env:[]v1.EnvVar(nil), Resources:v1.ResourceRequirements{Limits:v1.ResourceList(nil), Requests:v1.ResourceList(nil), Claims:[]v1.ResourceClaim(nil)}, ResizePolicy:[]v1.ContainerResizePolicy(nil), RestartPolicy:(*v1.ContainerRestartPolicy)(nil), VolumeMounts:[]v1.VolumeMount(nil), VolumeDevices:[]v1.VolumeDevice(nil), LivenessProbe:(*v1.Probe)(nil), ReadinessProbe:(*v1.Probe)(nil), StartupProbe:(*v1.Probe)(nil), Lifecycle:(*v1.Lifecycle)(nil), TerminationMessagePath:"", TerminationMessagePolicy:"", ImagePullPolicy:"", SecurityContext:(*v1.SecurityContext)(nil), Stdin:false, StdinOnce:false, TTY:false}}, EphemeralContainers:[]v1.EphemeralContainer(nil), RestartPolicy:"OnFailure", TerminationGracePeriodSeconds:(*int64)(nil), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"", NodeSelector:map[string]string(nil), ServiceAccountName:"", DeprecatedServiceAccount:"", AutomountServiceAccountToken:(*bool)(nil), NodeName:"", HostNetwork:false, HostPID:false, HostIPC:false, ShareProcessNamespace:(*bool)(nil), SecurityContext:(*v1.PodSecurityContext)(nil), ImagePullSecrets:[]v1.LocalObjectReference(nil), Hostname:"", Subdomain:"", Affinity:(*v1.Affinity)(nil), SchedulerName:"", Tolerations:[]v1.Toleration(nil), HostAliases:[]v1.HostAlias(nil), PriorityClassName:"", Priority:(*int32)(nil), DNSConfig:(*v1.PodDNSConfig)(nil), ReadinessGates:[]v1.PodReadinessGate(nil), RuntimeClassName:(*string)(nil), EnableServiceLinks:(*bool)(nil), PreemptionPolicy:(*v1.PreemptionPolicy)(nil), Overhead:v1.ResourceList(nil), TopologySpreadConstraints:[]v1.TopologySpreadConstraint(nil), SetHostnameAsFQDN:(*bool)(nil), OS:(*v1.PodOS)(nil), HostUsers:(*bool)(nil), SchedulingGates:[]v1.PodSchedulingGate(nil), ResourceClaims:[]v1.PodResourceClaim(nil)}}, TTLSecondsAfterFinished:(*int32)(nil), CompletionMode:(*v1.CompletionMode)(0xc00099fa60), Suspend:(*bool)(nil), PodReplacementPolicy:(*v1.PodReplacementPolicy)(nil)}}, Replicas:4}, v1alpha2.ReplicatedJob{Name:"workers", Template:v1.JobTemplateSpec{ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:v1.JobSpec{Parallelism:(*int32)(0xc0001fe32c), Completions:(*int32)(0xc0001fe330), ActiveDeadlineSeconds:(*int64)(nil), PodFailurePolicy:(*v1.PodFailurePolicy)(nil), BackoffLimit:(*int32)(0xc0001fe334), BackoffLimitPerIndex:(*int32)(nil), MaxFailedIndexes:(*int32)(nil), Selector:(*v1.LabelSelector)(nil), ManualSelector:(*bool)(nil), Template:v1.PodTemplateSpec{ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:v1.PodSpec{Volumes:[]v1.Volume(nil), InitContainers:[]v1.Container(nil), Containers:[]v1.Container{v1.Container{Name:"worker", Image:"bash:latest", Command:[]string{"bash", "-xc", "sleep 100000\n"}, Args:[]string(nil), WorkingDir:"", Ports:[]v1.ContainerPort(nil), EnvFrom:[]v1.EnvFromSource(nil), Env:[]v1.EnvVar(nil), Resources:v1.ResourceRequirements{Limits:v1.ResourceList(nil), Requests:v1.ResourceList(nil), Claims:[]v1.ResourceClaim(nil)}, ResizePolicy:[]v1.ContainerResizePolicy(nil), RestartPolicy:(*v1.ContainerRestartPolicy)(nil), VolumeMounts:[]v1.VolumeMount(nil), VolumeDevices:[]v1.VolumeDevice(nil), LivenessProbe:(*v1.Probe)(nil), ReadinessProbe:(*v1.Probe)(nil), StartupProbe:(*v1.Probe)(nil), Lifecycle:(*v1.Lifecycle)(nil), TerminationMessagePath:"", TerminationMessagePolicy:"", ImagePullPolicy:"", SecurityContext:(*v1.SecurityContext)(nil), Stdin:false, StdinOnce:false, TTY:false}}, EphemeralContainers:[]v1.EphemeralContainer(nil), RestartPolicy:"OnFailure", TerminationGracePeriodSeconds:(*int64)(nil), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"", NodeSelector:map[string]string(nil), ServiceAccountName:"", DeprecatedServiceAccount:"", AutomountServiceAccountToken:(*bool)(nil), NodeName:"", HostNetwork:false, HostPID:false, HostIPC:false, ShareProcessNamespace:(*bool)(nil), SecurityContext:(*v1.PodSecurityContext)(nil), ImagePullSecrets:[]v1.LocalObjectReference(nil), Hostname:"", Subdomain:"", Affinity:(*v1.Affinity)(nil), SchedulerName:"", Tolerations:[]v1.Toleration(nil), HostAliases:[]v1.HostAlias(nil), PriorityClassName:"", Priority:(*int32)(nil), DNSConfig:(*v1.PodDNSConfig)(nil), ReadinessGates:[]v1.PodReadinessGate(nil), RuntimeClassName:(*string)(nil), EnableServiceLinks:(*bool)(nil), PreemptionPolicy:(*v1.PreemptionPolicy)(nil), Overhead:v1.ResourceList(nil), TopologySpreadConstraints:[]v1.TopologySpreadConstraint(nil), SetHostnameAsFQDN:(*bool)(nil), OS:(*v1.PodOS)(nil), HostUsers:(*bool)(nil), SchedulingGates:[]v1.PodSchedulingGate(nil), ResourceClaims:[]v1.PodResourceClaim(nil)}}, TTLSecondsAfterFinished:(*int32)(nil), CompletionMode:(*v1.CompletionMode)(0xc00099fa70), Suspend:(*bool)(nil), PodReplacementPolicy:(*v1.PodReplacementPolicy)(nil)}}, Replicas:1}}: field is immutable

@kannon92
Copy link
Contributor Author

/retitle Support Elastic JobSets

@k8s-ci-robot k8s-ci-robot changed the title Discussion: Elastic JobSet Support Elastic JobSets Mar 21, 2024
@kannon92
Copy link
Contributor Author

I opened up #465 for discussion. We were treating the entire replicated job as immutable. It isn't clear to me what validation logic we want to have for a replicated job. We could go with just changing replicas (name and JobTemplate are immutable).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants