New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PESDLC0-1319 Log all kubectl command output #18271
Conversation
new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f4ff2-f5fa-46f1-9492-a005f1173645:
new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f5004-45f2-430a-8b5b-18e00bef4722:
new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f5004-45f5-477f-8e3f-949de62d0601:
new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f4ff2-f5f8-4b96-963c-9898efff9a61:
new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f537c-b5d0-4035-9c68-08673b6ae111:
new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f537c-aba0-4f10-b9d7-920b14eb2865:
new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f539a-3d23-4dcc-a4ed-a3ff1b96754b:
new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f539a-3d25-4d99-b2f1-84c0b0f74ed0:
new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f53ce-1eb3-4b63-8d29-2c24c09b438d:
new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f53ce-1e93-470b-b264-b630585c0952:
new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f5408-b361-4f2a-a343-5a1fb8065b19:
new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f5408-b363-423d-afc1-b42eb9a6c9cc:
new failures in https://buildkite.com/redpanda/redpanda/builds/48833#018f5920-e6de-4530-90ad-93d8e569d8a5:
new failures in https://buildkite.com/redpanda/redpanda/builds/48833#018f5920-e6e0-4063-b761-4ce64c1a03f2:
new failures in https://buildkite.com/redpanda/redpanda/builds/48833#018f5929-2d39-4f31-ba72-b6cbcfa3024b:
new failures in https://buildkite.com/redpanda/redpanda/builds/48833#018f5929-2d35-42e3-a932-5f4323721b7a:
|
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/48755#018f537c-aba0-4f10-b9d7-920b14eb2865 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/48755#018f539a-3d27-4529-b7f0-f25c548d7ded ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/48755#018f5408-b361-4f2a-a343-5a1fb8065b19 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/48851#018f5df8-5759-4d50-9412-7ba7aff53f53 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/48901#018f5ee3-b416-4742-a79a-16e472f6e370 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/48901#018f5ee3-b413-4818-92d6-c294ef501ce4 |
self._redpanda.logger.info(cmd) | ||
try: | ||
return subprocess.check_output(cmd, stderr=subprocess.PIPE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we changed away from check_output
. It seems like we manually re-implement almost the same semantics as before, which is to log both stderr and stdout if an error occurs (non-zero return code). The main difference seems to be that if stdout is empty we return stderr instead? Is that the main reason for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I do not want to mix stderr vs stdout and rely only on return code. In the future, once we understand why we have agent fails, we'll add the stderr checks as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check_output is the equivalent to run(..., check=True, stdout=PIPE).stdout
which not includes stderr which holds valuable strings. Also, this makes code clearer and more understandable.
tests/rptest/clients/kubectl.py
Outdated
# It will be hard to detect errors as a lot of apps and utils | ||
# just uses stderr as normal output. For example, tsh | ||
|
||
process.kill() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need kill
here, I don't think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process class in memory will not be freed in full otherwise. Also, there might be rare situations when RC exists and the process in still alive. Instead of p.join() I use kill to make sure that there will be no process hang of any kind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this probably needs update to handle timeouts and it will justify use of kill(). As process is not killed on timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timeout is a good reason to kill the process, but as you point out we won't kill
it here because that's an exception and will bypass this.
I don't think there is any need to call kill() here. Can you provide any reference? The docs certainly don't agree: the require communciate
to wait for process termination before returning (unless timeout). I checked the CPython code and kill does not clean up the process class, it simply sends a kill signal to the process, and if the return code is available (which it will be after communicate by construction) it doesn't do anything at all.
tests/rptest/clients/kubectl.py
Outdated
if process.returncode != 0: | ||
self.logger.info(f'Command failed (rc={process.returncode}).\n' | ||
f'{_prepare_output(s_out, s_err)}') | ||
raise RuntimeError("Last command returned non-zero exit code") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the key exception that will be surfaced in the results, please try to make it as good as the old one, which included the command string, etc. I think you should probably also throw CalledProcessError
for consistency as that's what we were returning before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit says log all output, but AFAICT we only log on error, same as before?
In line 159, here, there is debug log for the output |
@travisdowns, here is some examples of
and
|
Test run using integration and tier-1 config
|
Thanks missed this. |
tests/rptest/clients/kubectl.py
Outdated
stderr=subprocess.PIPE, | ||
text=True) | ||
s_out, s_err = process.communicate() | ||
self.logger.debug(_prepare_output(s_out, s_err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the failure case, doesn't this mean we log the output 2x to the debug log?
You could just log once and choose between "debug" and "info" based on whether the process failed or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Updated.
I'm not following. This looks like the normal The second output looks totally expected to me, it's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have access to PESDLC0-1319, so I may be missing some key context on the purpose of this change.
Having all output is critical for debug purposes. There might be something that will help to detect errors. |
There is nothing critical there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is nothing critical there.
Then please include in the cover letter the context and specific goals for this change, so I can evaluate on those merits.
Done. |
PESDLC-1319 Use exceptions and add timeout PESDLC-1319 Updated debug logging based on RC
New kubectl _localcmd method uses text mode. So calls to decode() should be removed
d617faa
to
ab4b1f8
Compare
Resolved conflicts after log copy merging and squashed multiple changes in the same place to one |
Test run after resolving conflicts
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the added clarity.
Just for a note, I haven't actually seen any problems with stderr/stdout going to the "wrong" place with nested commands like tsh ssh kubectl foo
- that is, it seemed to me that the output from foo
would end up on the same channel it was output all the way back to the consuming code in Python. Have you seen different?
Yes, when running kubectl exec commands. Sometimes output streams get mixed up. Not sure why. |
/backport v24.1.x |
Failed to create a backport PR to v24.1.x branch. I tried:
|
In order to better understand what is happening when commands are running and have full control in the future over output, this PR updates local cmd run command to log everything when running command and collect and handle all output from command being ran in subprocess.
Due to various scenarios with running multiple nested commands, output could env up in stderr or stdout. This is why we are using communicate instead of check_output or similar. This offers more flexibility on handling errors and getting correct Exception handling, etc (Timeouts, RC)
Major change is that running command should log all output (stdout and stderr) as debug when running normally (RC=0) and log output to info when command failed for some reason (RC!=0)
Also, there must be control over how long command should be running by adding timeout handling.
Backports Required
Release Notes