Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TTL for cleaning up Jobs after they finish #66840

Merged
merged 7 commits into from Sep 5, 2018

Conversation

janetkuo
Copy link
Member

@janetkuo janetkuo commented Jul 31, 2018

What this PR does / why we need it: kubernetes/enhancements#592

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #64470
For kubernetes/enhancements#592

Special notes for your reviewer: @kubernetes/sig-apps-pr-reviews

Release note:

Add a TTL machenism to clean up Jobs after they finish.

@janetkuo janetkuo requested a review from kow3ns July 31, 2018 21:50
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/apps Categorizes an issue or PR as relevant to SIG Apps. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 31, 2018
@k8s-github-robot k8s-github-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Jul 31, 2018
@janetkuo janetkuo changed the title Job ttl TTL for cleaning up Jobs after they finish Jul 31, 2018
@david-mcmahon david-mcmahon removed their request for review July 31, 2018 21:53
@janetkuo janetkuo removed the request for review from dchen1107 July 31, 2018 22:53
@janetkuo janetkuo force-pushed the job-ttl branch 3 times, most recently from a1ff331 to 94efaae Compare August 1, 2018 01:16
@tnozicka
Copy link
Contributor

tnozicka commented Aug 1, 2018

@janetkuo could we use dynamic client/informer to handle TTL for both finished Jobs and Pods in 1 controller? Seems like this would be fairly similar logic.

Copy link
Member

@kow3ns kow3ns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally approve. Needs e2e as well.

// finishes. This field is alpha-level and is only honored by servers that
// enable the TTLAfterFinished feature.
// +optional
TTLSecondsAfterFinished *int32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to consider using TTLSecondsAfterCompletion to be consistent with the terminology in JobStatus.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jobs finishes once its status is either complete or failed. Combined complete + failed into finished for simplicity, based on user feedback from kubeflow.

// alpha: v1.12
//
// Allow TTL controller to clean up Pods and Jobs after they finish.
TTLAfterFinished utilfeature.Feature = "TTLAfterFinished"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as previous comment with respect to Completion instead of Finished.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

clock clock.Clock
}

func New(jobInformer batchinformers.JobInformer, client clientset.Interface) *TTLAfterFinishedController {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment for publicly exposed method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"k8s.io/kubernetes/pkg/util/metrics"
)

type TTLAfterFinishedController struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should describe the type. You should add a comment explaining what the controller does and the design choice to have the contorller live outside of JobController.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return tc
}

func (tc *TTLAfterFinishedController) Run(workers int, stopCh <-chan struct{}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if err != nil {
return nil, err
}
// TODO: handle finish time that happens in the future
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should come up with a well defined behavior for when this occurs. Consider simply deferring destroying the victim after the finish time. It's conservative, but it can't do harm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Because it happens in the future it'll only delay destorying the Job.

@kow3ns
Copy link
Member

kow3ns commented Aug 2, 2018

/approve

@janetkuo
Copy link
Member Author

janetkuo commented Aug 3, 2018

@tnozicka this TTL controller only needs to watch Jobs and Pods for now; Jobs and Pods have different ways of indicating that they "finished". I'd like to switch to dynamic client/informer when we extend this to custom resources.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 6, 2018
@janetkuo janetkuo force-pushed the job-ttl branch 4 times, most recently from a439b11 to 322c881 Compare August 6, 2018 22:18
@kow3ns
Copy link
Member

kow3ns commented Aug 7, 2018

/approve

@kow3ns
Copy link
Member

kow3ns commented Aug 10, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 10, 2018
@janetkuo
Copy link
Member Author

/assign @deads2k @liggitt

Need your approval. Thanks!

@janetkuo
Copy link
Member Author

janetkuo commented Sep 4, 2018

Just fixing verify failure. Re-applying tag.

@janetkuo janetkuo added lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. kind/feature Categorizes issue or PR as related to a new feature. labels Sep 4, 2018
)

var (
ttlEnabled = utilfeature.DefaultFeatureGate.Enabled(features.TTLAfterFinished)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

declare this in TestJobsStrategy since it's the only place you are using it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -172,6 +173,9 @@ func NewKubeControllerManagerOptions() (*KubeControllerManagerOptions, error) {
ServiceController: &cmoptions.ServiceControllerOptions{
ConcurrentServiceSyncs: componentConfig.ServiceController.ConcurrentServiceSyncs,
},
TTLAfterFinishedController: &TTLAfterFinishedControllerOptions{
ConcurrentTTLSyncs: componentConfig.TTLAfterFinishedController.ConcurrentTTLSyncs,
Copy link
Member

@mikedanese mikedanese Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a proponent of YAGNI when it comes to configuration like parallel syncs. Is there a reason we want it explicitly for this feature as compared to other loops? An operator guessed value might be slighter better than a developer guessed value in an edge situation but the cost is extra code, maintainablity and suportability complexity. This was @cheftako's request

Copy link
Member

@mikedanese mikedanese Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer adding workqueue latency and handletime metrics and explore adaptive worker counts then hand knobs to operators as a last resort.

…abled

1. If TTLAfterFinished feature is enabled, the value should be non-negative.
2. If TTLAfterFinished feature is disabled, the field value should not
be kept.
make clean && make generated_files
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 4, 2018
job := cur.(*batch.Job)
glog.V(4).Infof("Updating job %s/%s", job.Namespace, job.Name)

if job.DeletionTimestamp == nil && needsCleanup(job) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not move job.DeletionTimestamp == nil into needsCleanup?

}

func getFinishAndExpireTime(j *batch.Job) (*time.Time, *time.Time, error) {
if !needsCleanup(j) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this once at the begining of processJob, then don't check it again in all these helpers?

@mikedanese
Copy link
Member

/approve

I'd like to see whether we can remove the flag before beta, but I won't block this PR.

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 4, 2018
@mikedanese
Copy link
Member

per previous lgtms
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 4, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheftako, dixudx, janetkuo, kow3ns, liggitt, mikedanese, tnozicka

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@janetkuo
Copy link
Member Author

janetkuo commented Sep 4, 2018

/test pull-kubernetes-e2e-kops-aws

1 similar comment
@janetkuo
Copy link
Member Author

janetkuo commented Sep 5, 2018

/test pull-kubernetes-e2e-kops-aws

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@janetkuo
Copy link
Member Author

janetkuo commented Sep 5, 2018

/retest

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 66840, 68159). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.

@k8s-github-robot k8s-github-robot merged commit c50a347 into kubernetes:master Sep 5, 2018
@k8s-ci-robot
Copy link
Contributor

@janetkuo: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-100-performance 13b76d5 link /test pull-kubernetes-e2e-gce-100-performance

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Job objects are not cleaned up
10 participants