Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Logging noise #678

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Bobbins228
Copy link

@Bobbins228 Bobbins228 commented Nov 2, 2023

Issue link

Closes #612

What changes have been made

Removed occurrences where an entire appwrapper is logged
Bumped large informational logs up to level 4 and kept warning logs as level independent.
Added new log which logs additional resources required for an appwrapper to be dispatched

Verification steps

Set the logging level for MCAD to be level 3 and check that level 3 related logs are accurate.
Repeat with level 4 logs.

To verify the required resources log

Create an AppWrapper that has a large number of resources e.g. 8 GPUs you should receive a log like this:

I1102 16:51:22.195583  119917 queuejob_controller_ex.go:1030] [ScheduleNext] Appwrapper 'default/raytest' requires additional resources CPU: 0.000000, Memory: 0.000000, GPU: 8.000000

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

Copy link
Contributor

@ChristianZaccaria ChristianZaccaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Mark, this is awesome job! Left a few comments and thoughts

pkg/controller/queuejob/queuejob_controller_ex.go Outdated Show resolved Hide resolved
pkg/controller/queuejob/queuejob_controller_ex.go Outdated Show resolved Hide resolved
@Bobbins228 Bobbins228 changed the title [WIP] Reduce Logging noise Reduce Logging noise Nov 6, 2023
@Bobbins228
Copy link
Author

@ChristianZaccaria Tested this out and it seems no matter the log level it will always print logs below its level. e.g. set log level to 6 you will get all logs below level 6 too.

I have opted to make warning logs like Failed to dispatch app wrapper due to insufficient resources to have no log level so they are always printed.

More informational logs that are noisy have been bumped to log level 4
For example:
available resource successful check for appwrapper has an increased log level because we would know if we did not have enough resources anyway.

@ChristianZaccaria
Copy link
Contributor

ChristianZaccaria commented Nov 6, 2023

/retest

Seems the CI is acting up. I'm pretty sure it has nothing to do with your changes.

@ChristianZaccaria
Copy link
Contributor

Hey Mark, could you try change i.e., a comment or add a comment and push changes, just to trigger the CI again and see if it works. The /retest option doesn't retrigger it for some reason

@Bobbins228
Copy link
Author

/retest

1 similar comment
@dimakis
Copy link
Contributor

dimakis commented Nov 9, 2023

/retest

@ChristianZaccaria
Copy link
Contributor

Hi @Bobbins228, I've ran MCAD locally with the debugger, and when I attempt to bring up a cluster object (cluster.up), it initially says there is insufficient resources to dispatch appwrapper. Then, I get the Raw of a GenericTemplate which are lots of numbers such as below:
image
Would you know what could this be? I'll run MCAD on the cluster as opposed via the debugger but I assume same result.

@Bobbins228
Copy link
Author

@ChristianZaccaria What log level do you have your MCAD set to?

@ChristianZaccaria
Copy link
Contributor

@ChristianZaccaria What log level do you have your MCAD set to?

On the debugger it's set to 15 by default, I thought that could be it but, update on my previous comment: With this PR I don't see the raw GenericTemplate on MCAD logs when ran in the cluster. However, when I run mcad (main branch) via the debugger, I don't get the Raw GenericTemplate. Something in this PR is causing the debugger to display that for some reason.

Copy link
Contributor

@ChristianZaccaria ChristianZaccaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm /approve logs are displayed as expected. Verified with log level set to 6. Great job Mark!

@openshift-ci openshift-ci bot added the lgtm label Nov 27, 2023
Copy link
Member

@asm582 asm582 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, can you please address the review

pkg/controller/queuejob/queuejob_controller_ex.go Outdated Show resolved Hide resolved
Copy link
Contributor

@ChristianZaccaria ChristianZaccaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! Thanks for addressing all the requested changes. /lgtm

@openshift-ci openshift-ci bot added the lgtm label Dec 5, 2023
Copy link

openshift-ci bot commented Dec 5, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ChristianZaccaria
Once this PR has been reviewed and has the lgtm label, please assign anishasthana for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[logging] Umbrella issue to review logging in MCAD
6 participants