Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PESDLC0-1319 Log all kubectl command output #18271

Merged
merged 3 commits into from May 13, 2024

Conversation

savex
Copy link
Contributor

@savex savex commented May 6, 2024

In order to better understand what is happening when commands are running and have full control in the future over output, this PR updates local cmd run command to log everything when running command and collect and handle all output from command being ran in subprocess.

Due to various scenarios with running multiple nested commands, output could env up in stderr or stdout. This is why we are using communicate instead of check_output or similar. This offers more flexibility on handling errors and getting correct Exception handling, etc (Timeouts, RC)

Major change is that running command should log all output (stdout and stderr) as debug when running normally (RC=0) and log output to info when command failed for some reason (RC!=0)

Also, there must be control over how long command should be running by adding timeout handling.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

  • none

@savex savex requested review from travisdowns and rpdevmp May 6, 2024 20:33
@savex savex self-assigned this May 6, 2024
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented May 6, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f4ff2-f5fa-46f1-9492-a005f1173645:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f5004-45f2-430a-8b5b-18e00bef4722:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f5004-45f5-477f-8e3f-949de62d0601:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f4ff2-f5f8-4b96-963c-9898efff9a61:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f537c-b5d0-4035-9c68-08673b6ae111:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f537c-aba0-4f10-b9d7-920b14eb2865:

"rptest.tests.consumer_group_test.ConsumerGroupTest.test_consumer_rejoin.static_members=True"
"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.ABS"
"rptest.tests.archival_test.ArchivalTest.test_isolate.cloud_storage_type=CloudStorageType.ABS"
"rptest.tests.full_disk_test.LocalDiskReportTest.test_target_min_capacity"
"rptest.tests.rpk_start_test.RpkRedpandaStartTest.test_bootstrap_then_start"

new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f539a-3d23-4dcc-a4ed-a3ff1b96754b:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f539a-3d25-4d99-b2f1-84c0b0f74ed0:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f53ce-1eb3-4b63-8d29-2c24c09b438d:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f53ce-1e93-470b-b264-b630585c0952:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f5408-b361-4f2a-a343-5a1fb8065b19:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/48755#018f5408-b363-423d-afc1-b42eb9a6c9cc:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/48833#018f5920-e6de-4530-90ad-93d8e569d8a5:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/48833#018f5920-e6e0-4063-b761-4ce64c1a03f2:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/48833#018f5929-2d39-4f31-ba72-b6cbcfa3024b:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/48833#018f5929-2d35-42e3-a932-5f4323721b7a:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.ABS"

self._redpanda.logger.info(cmd)
try:
return subprocess.check_output(cmd, stderr=subprocess.PIPE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we changed away from check_output. It seems like we manually re-implement almost the same semantics as before, which is to log both stderr and stdout if an error occurs (non-zero return code). The main difference seems to be that if stdout is empty we return stderr instead? Is that the main reason for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I do not want to mix stderr vs stdout and rely only on return code. In the future, once we understand why we have agent fails, we'll add the stderr checks as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_output is the equivalent to run(..., check=True, stdout=PIPE).stdout which not includes stderr which holds valuable strings. Also, this makes code clearer and more understandable.

# It will be hard to detect errors as a lot of apps and utils
# just uses stderr as normal output. For example, tsh

process.kill()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need kill here, I don't think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process class in memory will not be freed in full otherwise. Also, there might be rare situations when RC exists and the process in still alive. Instead of p.join() I use kill to make sure that there will be no process hang of any kind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this probably needs update to handle timeouts and it will justify use of kill(). As process is not killed on timeout.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout is a good reason to kill the process, but as you point out we won't kill it here because that's an exception and will bypass this.

I don't think there is any need to call kill() here. Can you provide any reference? The docs certainly don't agree: the require communciate to wait for process termination before returning (unless timeout). I checked the CPython code and kill does not clean up the process class, it simply sends a kill signal to the process, and if the return code is available (which it will be after communicate by construction) it doesn't do anything at all.

if process.returncode != 0:
self.logger.info(f'Command failed (rc={process.returncode}).\n'
f'{_prepare_output(s_out, s_err)}')
raise RuntimeError("Last command returned non-zero exit code")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key exception that will be surfaced in the results, please try to make it as good as the old one, which included the command string, etc. I think you should probably also throw CalledProcessError for consistency as that's what we were returning before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will do.

Copy link
Member

@travisdowns travisdowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit says log all output, but AFAICT we only log on error, same as before?

@savex
Copy link
Contributor Author

savex commented May 7, 2024

The commit says log all output, but AFAICT we only log on error, same as before?

In line 159, here, there is debug log for the output

@savex
Copy link
Contributor Author

savex commented May 8, 2024

@travisdowns, here is some examples of tsh output with this change.
Crazy enough, tsh do not respect normal/err channels at all and writes whereever it wants :)

[INFO  - 2024-05-08 15:41:08,427 - kubectl - _local_cmd - lineno:157]: ['tbot', 'start', '--data-dir=/tmp/tbot-data', '--destination-dir=/tmp/machine-id', '--auth-server=proxy.tp.redpanda.com:443', '--join-met
hod=iam', '--token=buildkite-tp', '--certificate-ttl=6h', '--renewal-interval=6h', '--oneshot']
[DEBUG - 2024-05-08 15:41:25,792 - kubectl - _local_cmd - lineno:178]:
--------- stdout -----------

--------- stderr -----------
WARN [TBOT]      CLI parameters are overriding onboarding config from  config/config.go:472
INFO [TBOT]      Created directory "/tmp/tbot-data" config/destination_directory.go:132
INFO [TBOT]      Created directory "/tmp/machine-id" config/destination_directory.go:132
INFO [TBOT]      Anonymous telemetry is not enabled. Find out more about Machine ID's anonymous telemetry at https://goteleport.com/docs/machine-id/reference/telemetry/ tbot/anonymous_telemetry.go:83
INFO [TBOT]      Attempting to generate new identity from token tbot/renew.go:510
INFO [AUTH]      Attempting registration via proxy server. auth/register.go:283
INFO [AUTH]      Attempting to register Bot with IAM method using regional STS endpoint auth/register.go:629
INFO [AUTH]      Successfully registered Bot with IAM method using regional STS endpoint auth/register.go:662
INFO [AUTH]      Successfully registered via proxy server. auth/register.go:290
INFO [TBOT]      Successfully generated new bot identity, valid: after=2024-05-08T15:40:20Z, before=2024-05-08T21:41:19Z, duration=6h0m59s | kind=tls, renewable=false, disallow-reissue=false, roles=[bot-buildk
ite-robot], principals=[-teleport-internal-join], generation=0 tbot/tbot.go:446
INFO [TBOT]      Beginning renewal loop: ttl=6h0m0s interval=6h0m0s tbot/renew.go:726
INFO [TBOT]      Started watching for CA rotations tbot/ca_rotation.go:173
INFO [TBOT]      Attempting to generate new identity from token tbot/renew.go:510
INFO [AUTH]      Attempting registration via proxy server. auth/register.go:283
INFO [AUTH]      Attempting to register Bot with IAM method using regional STS endpoint auth/register.go:629
INFO [AUTH]      Successfully registered Bot with IAM method using regional STS endpoint auth/register.go:662
INFO [AUTH]      Successfully registered via proxy server. auth/register.go:290
INFO [TBOT]      Successfully renewed bot certificates, valid: after=2024-05-08T15:40:22Z, before=2024-05-08T21:41:22Z, duration=6h1m0s | kind=tls, renewable=false, disallow-reissue=false, roles=[bot-buildkite
-robot], principals=[-teleport-internal-join], generation=0 tbot/renew.go:634
INFO [TBOT]      Successfully renewed impersonated certificates for directory /tmp/machine-id, valid: after=2024-05-08T15:40:24Z, before=2024-05-08T21:41:24Z, duration=6h1m0s | kind=tls, renewable=false, disal
low-reissue=true, roles=[teleport-admin], principals=[root redpanda {{internal.logins}} -teleport-internal-join], generation=0 tbot/renew.go:697
INFO [TBOT]      Started watching for CA rotations tbot/ca_rotation.go:173
INFO [TBOT]      Persisted certificates successfully. One-shot mode enabled so exiting. tbot/renew.go:780

and

[INFO  - 2024-05-08 15:41:25,792 - kubectl - _local_cmd - lineno:157]: ['tsh', 'ssh', '--proxy=proxy.tp.redpanda.com:443', '--auth=okta', '--identity=/tmp/machine-id/identity', 'redpanda@cotp907crd7qb34obo0g-a
gent', './breakglass-tools.sh']
[DEBUG - 2024-05-08 15:42:09,972 - kubectl - _local_cmd - lineno:178]:
--------- stdout -----------
Added new context arn:aws:eks:us-west-2:650492572702:cluster/redpanda-cotp907crd7qb34obo0g to /home/redpanda/.kube/config
Reading package lists...
Building dependency tree...
Reading state information...
Suggested packages:
  zip
The following NEW packages will be installed:
  unzip
0 upgraded, 1 newly installed, 0 to remove and 6 not upgraded.
Need to get 175 kB of archives.
After this operation, 386 kB of additional disk space will be used.
Get:1 http://us-west-2.ec2.archive.ubuntu.com/ubuntu jammy-updates/main amd64 unzip amd64 6.0-26ubuntu3.2 [175 kB]
Fetched 175 kB in 0s (5124 kB/s)
Selecting previously unselected package unzip.
(Reading database ...
(Reading database ... 5%
(Reading database ... 10%
(Reading database ... 15%
(Reading database ... 20%
(Reading database ... 25%
(Reading database ... 30%
(Reading database ... 35%
(Reading database ... 40%
(Reading database ... 45%
(Reading database ... 50%
(Reading database ... 55%
(Reading database ... 60%
(Reading database ... 65%
(Reading database ... 70%
(Reading database ... 75%
(Reading database ... 80%
(Reading database ... 85%
(Reading database ... 90%
(Reading database ... 95%
(Reading database ... 100%
(Reading database ... 72472 files and directories currently installed.)
Preparing to unpack .../unzip_6.0-26ubuntu3.2_amd64.deb ...
Unpacking unzip (6.0-26ubuntu3.2) ...
Setting up unzip (6.0-26ubuntu3.2) ...
Processing triggers for man-db (2.10.2-1) ...
NEEDRESTART-VER: 3.5
NEEDRESTART-KCUR: 6.5.0-1018-aws
NEEDRESTART-KEXP: 6.5.0-1018-aws
NEEDRESTART-KSTA: 1
=== Redpanda Agent VM tools ready

Tools installed:
- kubectl
- k alias for kubectl
- k9s
- grpcurl
- rpk
- stern

--------- stderr -----------
=== installing kubectl
=== creating k alias to kubectl
=== installing k9s
=== AWS VM detected. Running AWS specific configurations
error: current-context is not set
=== configuring k8s context
=== installing grpcurl
=== installing unzip
=== installing rpk
=== installing stern

@savex savex requested a review from travisdowns May 8, 2024 15:49
@savex
Copy link
Contributor Author

savex commented May 8, 2024

Test run using integration and tier-1 config

test_id:    rptest.redpanda_cloud_tests.rolling_restart_test.RollingRestartTest.test_restart_pod
status:     PASS
run time:   40 minutes 58.533 seconds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_id:    rptest.redpanda_cloud_tests.rolling_restart_test.RollingRestartTest.test_rolling_restart
status:     PASS
run time:   5 minutes 27.899 seconds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_id:    rptest.redpanda_cloud_tests.cloud_self_test.RedpandaCloudSelfTest.test_healthy
status:     PASS
run time:   2 minutes 11.526 seconds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
========================================================================================================================================================================================================================================================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.8.18
session_id:       2024-05-08--001
run time:         48 minutes 43.116 seconds
tests run:        3
passed:           3
flaky:            0
failed:           0
ignored:          0
opassed:          0
ofailed:          0
opassedfips:      0
ofailedfips:      0
========================================================================================================================================================================================================================================================================================================================

@travisdowns
Copy link
Member

In line 159, here, there is debug log for the output

Thanks missed this.

stderr=subprocess.PIPE,
text=True)
s_out, s_err = process.communicate()
self.logger.debug(_prepare_output(s_out, s_err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the failure case, doesn't this mean we log the output 2x to the debug log?

You could just log once and choose between "debug" and "info" based on whether the process failed or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Updated.

@travisdowns
Copy link
Member

travisdowns commented May 8, 2024

@travisdowns, here is some examples of tsh output with this change.
Crazy enough, tsh do not respect normal/err channels at all and writes whereever it wants :)

I'm not following. This looks like the normal tsh output that we saw in the console prior to my recent change to capture it (and output it on error). What does you mean that it does not respect normal/err channels at all? It is normal to output diagnostic information and logs to stderr if that's what you mean.

The second output looks totally expected to me, it's tsh ssh so tsh itself should not emit anything to either channel but simply pass through whatever the commands on the other side write (which it looks to me like it's doing).

Copy link
Member

@travisdowns travisdowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have access to PESDLC0-1319, so I may be missing some key context on the purpose of this change.

@savex
Copy link
Contributor Author

savex commented May 8, 2024

@travisdowns, here is some examples of tsh output with this change.
Crazy enough, tsh do not respect normal/err channels at all and writes whereever it wants :)

I'm not following. This looks like the normal tsh output that we saw in the console prior to my recent change to capture it (and output it on error). What does you mean that it does not respect normal/err channels at all? It is normal to output diagnostic information and logs to stderr if that's what you mean.

The second output looks totally expected to me, it's tsh ssh so tsh itself should not emit anything to either channel but simply pass through whatever the commands on the other side write (which it looks to me like it's doing).

Having all output is critical for debug purposes. There might be something that will help to detect errors.

@savex
Copy link
Contributor Author

savex commented May 8, 2024

I don't have access to PESDLC0-1319, so I may be missing some key context on the purpose of this change.

There is nothing critical there.

@savex savex requested a review from travisdowns May 8, 2024 20:53
Copy link
Member

@travisdowns travisdowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing critical there.

Then please include in the cover letter the context and specific goals for this change, so I can evaluate on those merits.

@savex
Copy link
Contributor Author

savex commented May 9, 2024

There is nothing critical there.

Then please include in the cover letter the context and specific goals for this change, so I can evaluate on those merits.

Done.

savex added 2 commits May 9, 2024 13:36
PESDLC-1319 Use exceptions and add timeout

PESDLC-1319 Updated debug logging based on RC
   New kubectl _localcmd method uses text mode.
   So calls to decode() should be removed
@savex savex force-pushed the PESDLC-1319-log-all-kubectl-output branch from d617faa to ab4b1f8 Compare May 9, 2024 18:36
@savex
Copy link
Contributor Author

savex commented May 9, 2024

Resolved conflicts after log copy merging and squashed multiple changes in the same place to one

@savex savex requested a review from a team May 9, 2024 18:38
@savex savex requested a review from travisdowns May 9, 2024 20:17
@savex
Copy link
Contributor Author

savex commented May 9, 2024

Test run after resolving conflicts

test_id:    rptest.redpanda_cloud_tests.rolling_restart_test.RollingRestartTest.test_restart_pod
status:     PASS
run time:   43 minutes 40.756 seconds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_id:    rptest.redpanda_cloud_tests.rolling_restart_test.RollingRestartTest.test_rolling_restart
status:     PASS
run time:   7 minutes 14.169 seconds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_id:    rptest.redpanda_cloud_tests.cloud_self_test.RedpandaCloudSelfTest.test_healthy
status:     PASS
run time:   2 minutes 9.011 seconds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
========================================================================================================================================================================================================================================================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.8.18
session_id:       2024-05-09--001
run time:         53 minutes 8.957 seconds
tests run:        3
passed:           3
flaky:            0
failed:           0
ignored:          0
opassed:          0
ofailed:          0
opassedfips:      0
ofailedfips:      0
========================================================================================================================================================================================================================================================================================================================

Copy link
Member

@travisdowns travisdowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the added clarity.

Just for a note, I haven't actually seen any problems with stderr/stdout going to the "wrong" place with nested commands like tsh ssh kubectl foo - that is, it seemed to me that the output from foo would end up on the same channel it was output all the way back to the consuming code in Python. Have you seen different?

@savex
Copy link
Contributor Author

savex commented May 13, 2024

Yes, when running kubectl exec commands. Sometimes output streams get mixed up. Not sure why.

@savex savex merged commit 5629704 into dev May 13, 2024
16 checks passed
@savex savex deleted the PESDLC-1319-log-all-kubectl-output branch May 13, 2024 14:13
@vbotbuildovich
Copy link
Collaborator

/backport v24.1.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-18271-v24.1.x-93 remotes/upstream/v24.1.x
git cherry-pick -x aa1ac6ddcb5be7e2fc0a0a06240b6713331c6033 ab4b1f80d17a98ab2bf1ce12ac3579178978a624 214634a56b0f9ce3749fb06d6795f83e8aadda22

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants