Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCAD controller logs #690

Open
ordavidov opened this issue Nov 14, 2023 · 0 comments
Open

MCAD controller logs #690

ordavidov opened this issue Nov 14, 2023 · 0 comments

Comments

@ordavidov
Copy link

ordavidov commented Nov 14, 2023

Describe the Bug

  • Seeing multiple delete attempts on the same job ID.
  • Seeing many deleteJob log events and very few others.

Steps to Reproduce the Bug

The MCAD log stats come from the log file year=2023/month=11/day=13/97741eb604.2023-11-13.2300.json.gz in dipc-prod-logs. It covers the 1 hour time period from 2023-11-13 23:00:00 to 2023-11-13 23:59:59. They are also summarized below.

Here is the stats summary by log event type:
MCAD Log Event Type | # Log Events
deleteJob | 58423
processCleanupJob | 318
Unknown | 293
UpdatePod | 251
AddPod | 67

Here are the Top5 results of repeated job logs on the same job ID:
Job ID | # Log Events
66d95bbd-e9ca-40ed-966e-863a5f60a8d1 | 2807
1a839594-a273-46ff-b83c-824e11645ba0 | 2740
a03b1fbb-0116-42d6-a822-1f09bd2b0238 | 2160
e73488e7-8a41-49f3-94a3-5a4f51d03f93 | 2160
4e6dc4ba-41b6-49b9-bd50-a6fc3e818349 | 2160

What Have You Already Tried to Debug the Issue?

My understanding is that MCAD reports repeated attempts to delete a job, even though it has already been deleted.

Expected Behavior

MCAD controller logs accurately reflect job handling on Vela cluster.

Additional Context

Add as applicable and when known:

  • Cloud: IBM COS dipc-prod-logs. See here for access.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

1 participant