Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compatible: add system logging #136

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
52 changes: 52 additions & 0 deletions .github/workflows/integration_test_charm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -328,6 +328,58 @@ jobs:
run: |
juju switch test
mkdir ~/logs/
- name: Run SOS reports
if: ${{ failure() && steps.tests.outcome == 'failure' }}
run: |
sudo snap install sosreport --channel=latest/stable --classic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should maybe install from apt

I'd like to use the newer version, but the snap publisher isn't verified

and given that we might be passing github secrets to sosreport, might be a security risk if the snap maintainer is compromised

# Needed as sosreport does not like 100+ char long paths
mkdir /tmp/sos
sudo sos report \
--only-plugins kubernetes,systemd,logs,juju \
--enable-plugins kubernetes,juju \
-k kubernetes.describe=true -k kubernetes.podlogs=true -k kubernetes.all=true \
--batch \
--clean \
--tmp-dir=/tmp/sos \
-z gzip
- name: Run SOS in LXCs if Needed
if: ${{ inputs.cloud == 'lxd' && (failure() && steps.tests.outcome == 'failure') }}
Comment on lines +345 to +346
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you separate this step into another PR?

I'd like to get the other changes merged quickly & I think there's some complexity/subtleties (especially around secret redaction) with sosreport inside the lxc container that will hold up the other changes

run: |
if [ -z "$(sudo lxc list -f csv | wc -l)" ]; then
echo "No containers available, nothing to collect logs for..."
exit 0
fi

juju exec --parallel=true --all -- sudo snap install sosreport --channel=latest/stable --classic
sudo snap install jq
export NODES
NODES="$(juju status --format=json | jq -r '.machines[]|."ip-addresses"[0]' | paste -s -d, -)"
echo "Found nodes: $NODES"

echo "Total space before running command:"
sudo df -h

sudo sos collect \
-i ~/.local/share/juju/ssh/juju_id_rsa --ssh-user ubuntu --no-local \
--nodes "$NODES" \
--only-plugins systemd,logs,juju \
-k logs.all_logs=true \
--batch \
--clean \
--tmp-dir=/tmp/sos \
-z gzip -j 1
- name: Prepare upload - local reports
if: ${{ failure() && steps.tests.outcome == 'failure' }}
run: |
I="$(whoami)"
sudo chown -R "$I" /tmp/sos/
mv /tmp/sos/*.tar.gz ~/logs/

- name: Print kernel messages
if: ${{ failure() }}
run: |
sudo dmesg

- name: juju status
if: ${{ success() || (failure() && steps.tests.outcome == 'failure') }}
run: juju status --color --relations | tee ~/logs/juju-status.txt
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,4 +58,5 @@ Note: all workflows in this repository share a version number. If a breaking cha
If you do not want to use Renovate, pin to the latest major version (e.g. `v1`).

## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md)
See [CONTRIBUTING.md](CONTRIBUTING.md)
See [TROUBLESHOOTING_CHARMS.md](TROUBLESHOOTING_CHARMS.md)
58 changes: 58 additions & 0 deletions TROUBLESHOOTING_CHARMS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
Whenever a test fails, data-platform-workflows will capture that run logs using [sosreport](https://github.com/sosreport/sos).

The logs can be downloaded from the run's "Summary" page.

The sosreport is ran in the actual runner and captures logs from the host itself as well as the model's containers (LXC / k8s).

# Log structure

```
/
|
+---- juju-debug-log.txt: captured at the end of the run
|
+---- juju-status.txt: captured at the end of the run
|
+---- sos-collector-...
|
+---- sosreport-...
```

## Github Runner logs

The tarball `sosreport-` contains all the host logs. It will hold its syslog, journal and kernel logs.

Relevant logs:
* /var/log/{kern,syslog}.log: OS-related logs, including kernel
* /sos_commands/kubernetes/: logs related to the k8s infra and its pods
* /sos_commands/logs/: journalctl outputs

## LXC logs

The workflow also runs `sos collect` against each of the LXC containers, if they are available in the model.

The goal is to collect system level logs of the containers, as well as juju's.

These logs will be in `sos-collector-...` tarball. In that tarball, each container will have its own `sosreport-...`.

Each tarball will contain a subset of the logs mentioned in the previous section (since logs such as kern.log or k8s
do not make sense within LXC containers).

# Missing any extra logs?

If any logs are missing, e.g. logs in specific folders of /var/snap, then the steps are:
1) Extend or add a new plugin to the sosreport
2) Add it as an extra plugin (if needed) to the `integration_test_charms.yaml`.

It is important that, not only the sosreport PR has been merged upstream, but the change makes its way into the
sosreport's official snap and the [packages in Ubuntu](https://packages.ubuntu.com/search?suite=all&arch=any&searchon=names&keywords=sosreport).

# Notes

It is important to state these commands are ran at the end of the test, if it fails; therefore, if a container has been
created and destroyed during the test, it will not show in the sosreports. However, juju debug logs will contain every
log exchanged with the controller, and hence, even history of destroyed units.

If the syslog file has no recent logs, then check the /sos_commands/logs for the journalctl outputs. Normally, they will
correspond to the same logs but journalctl may be more up-to-date.