CI: Conformance Ginkgo: Provision LVH VMs #32400

thorn3r · 2024-05-07T14:49:04Z

CI failure

I've seen this failure a couple times in the past day. It looks like the LVH VM took longer than usual to provision and breached the timeout. It's not immediately clear what is taking too long from scanning through the logs.

Run n=0
  n=0
  started=0
  until [ "$n" -ge 300 ]; do
    if grep -E ".*OK.*Started.*ssh.*" /tmp/console.log; then
      started=1
      break
    elif grep -E ".*FAILED.*Failed.*to.*start.*ssh*" /tmp/console.log; then
      cat /tmp/console.log
      exit 40
    fi
    n=$((n+1))
    sleep 1
  done
  if [ $started -eq 0 ]; then
    cat /tmp/console.log
    exit 41
  fi
  
  n=0
  success=0
  until [ "$n" -ge 5 ]; do
    if ssh -p 2222 -o "StrictHostKeyChecking=no" root@localhost exit; then
      success=1
      break
    fi
    n=$((n+1))
    sleep 1
  done
  if [ $success -eq 0 ]; then
    cat /tmp/console.log
    exit 42
  fi
--snip--
[  OK  ] Reached target multi-user.target - Multi-User System.

[  OK  ] Reached target graphical.target - Graphical Interface.

         Starting systemd-update-utmp-runle…- Record Runlevel Change in UTMP...

[  OK  ] Finished systemd-update-utmp-runle…e - Record Runlevel Change in UTMP.



Debian GNU/Linux trixie/sid kind-bpf-next ttyS0

kind-bpf-next login: 
Error: Process completed with exit code 41.

Full output in this gist: https://gist.github.com/thorn3r/d9d138759c8bbc639e8da138803af243
Workflow run: https://github.com/cilium/cilium/actions/runs/8978306359/job/24662060373
PR: #32353

The text was updated successfully, but these errors were encountered:

joestringer · 2024-05-07T17:11:59Z

Looking at the timestamps provided by GitHub, it's just the lvh run command that took so long. Not sure if it's possible to tee the output or otherwise access /tmp/console.log from the sysdump - if not, then maybe improving the CI to the point where we can access that would be a first step. Beyond that maybe we could do with an lvh run --debug option to provide additional details.

thorn3r · 2024-05-07T21:26:20Z

No sysdump unfortunately, it's failing too early. Hit another one here: https://github.com/cilium/cilium/actions/runs/8991412734/workflow

thorn3r · 2024-05-09T17:54:23Z

Hit a similar one in Conformance E2E
Workflow: https://github.com/cilium/cilium/actions/runs/9011258778/job/24758598656
PR: #32403

mhofstetter · 2024-05-15T10:04:35Z

Looking at the timestamps provided by GitHub, it's just the lvh run command that took so long. Not sure if it's possible to tee the output or otherwise access /tmp/console.log from the sysdump - if not, then maybe improving the CI to the point where we can access that would be a first step. Beyond that maybe we could do with an lvh run --debug option to provide additional details.

The LVH GitHub Action already outputs the content of /tmp/console.log in case of a failure or timeout: https://github.com/cilium/little-vm-helper/blob/main/action.yaml#L190-L206 (optimized to analyze a previous LVH issue)

error code 41 indicates that starting the LVH SSH server exceeded the default timeout of 300 retries with 1s sleep.

Taking a closer look at the console output of the reported workflow runs it looks like the issue is a parallel write / interleaved output to the console file exactly on the logline that we're trying to grep to verify a successful start of the SSH server. (grep -E ".*OK.*Started.*ssh.*" /tmp/console.log; doesn't match the newline)

https://github.com/cilium/cilium/actions/runs/8978306359/job/24662060373

[  OK  ] Started     4.892581] e2scrub_all (320) used greatest stack depth: 12480 bytes left
1;39mssh.service - OpenBSD Secure Shell server.

https://github.com/cilium/cilium/actions/runs/9011258778/job/24758598656

[  OK      4.925631] e2scrub_all (322) used greatest stack depth: 13128 bytes left
0m] Started ssh.service - OpenBSD Secure Shell server.

https://github.com/cilium/cilium/actions/runs/8991412734/attempts/1

[  OK  ] Started     4.887573] e2scrub_all (319) used greatest stack depth: 12416 bytes left
1;39mssh.service - OpenBSD Secure Shell server.

lvh flag --console-log-file seems to be directly passed on to qemu - i don't know how much we can influence this behavior.

But i can try to optimize the regex to include line breaks - with the hope that the output not breaks in between words as well 😕

edit: or we could simply skip the "ssh availability log check" and only rely on the retries of the SSH connect (and increase the number of connection attempts)

thorn3r · 2024-05-15T15:13:48Z

hit again in https://github.com/cilium/cilium/actions/runs/9084931973/job/24967098062 on #32523

mhofstetter · 2024-05-16T06:13:01Z

reopening until LVH release (with potential fix in it) lands in cilium/cilium

edit: related dependency update PR: #32566

thorn3r added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels May 7, 2024

mhofstetter mentioned this issue May 15, 2024

gha: remove fragile SSH startup check cilium/little-vm-helper#199

Merged

kkourt closed this as completed in cilium/little-vm-helper#199 May 16, 2024

mhofstetter reopened this May 16, 2024

mhofstetter closed this as completed May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: Conformance Ginkgo: Provision LVH VMs #32400

CI: Conformance Ginkgo: Provision LVH VMs #32400

thorn3r commented May 7, 2024

joestringer commented May 7, 2024

thorn3r commented May 7, 2024

thorn3r commented May 9, 2024

mhofstetter commented May 15, 2024 •

edited

thorn3r commented May 15, 2024

mhofstetter commented May 16, 2024 •

edited

CI: Conformance Ginkgo: Provision LVH VMs #32400

CI: Conformance Ginkgo: Provision LVH VMs #32400

Comments

thorn3r commented May 7, 2024

CI failure

joestringer commented May 7, 2024

thorn3r commented May 7, 2024

thorn3r commented May 9, 2024

mhofstetter commented May 15, 2024 • edited

thorn3r commented May 15, 2024

mhofstetter commented May 16, 2024 • edited

mhofstetter commented May 15, 2024 •

edited

mhofstetter commented May 16, 2024 •

edited