Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Conformance Ginkgo: Provision LVH VMs #32400

Closed
thorn3r opened this issue May 7, 2024 · 6 comments · Fixed by cilium/little-vm-helper#199
Closed

CI: Conformance Ginkgo: Provision LVH VMs #32400

thorn3r opened this issue May 7, 2024 · 6 comments · Fixed by cilium/little-vm-helper#199
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!

Comments

@thorn3r
Copy link
Contributor

thorn3r commented May 7, 2024

CI failure

I've seen this failure a couple times in the past day. It looks like the LVH VM took longer than usual to provision and breached the timeout. It's not immediately clear what is taking too long from scanning through the logs.

Run n=0
  n=0
  started=0
  until [ "$n" -ge 300 ]; do
    if grep -E ".*OK.*Started.*ssh.*" /tmp/console.log; then
      started=1
      break
    elif grep -E ".*FAILED.*Failed.*to.*start.*ssh*" /tmp/console.log; then
      cat /tmp/console.log
      exit 40
    fi
    n=$((n+1))
    sleep 1
  done
  if [ $started -eq 0 ]; then
    cat /tmp/console.log
    exit 41
  fi
  
  n=0
  success=0
  until [ "$n" -ge 5 ]; do
    if ssh -p 2222 -o "StrictHostKeyChecking=no" root@localhost exit; then
      success=1
      break
    fi
    n=$((n+1))
    sleep 1
  done
  if [ $success -eq 0 ]; then
    cat /tmp/console.log
    exit 42
  fi
--snip--
[  OK  ] Reached target multi-user.target - Multi-User System.

[  OK  ] Reached target graphical.target - Graphical Interface.

         Starting systemd-update-utmp-runle…- Record Runlevel Change in UTMP...

[  OK  ] Finished systemd-update-utmp-runle…e - Record Runlevel Change in UTMP.



Debian GNU/Linux trixie/sid kind-bpf-next ttyS0

kind-bpf-next login: 
Error: Process completed with exit code 41.

Full output in this gist: https://gist.github.com/thorn3r/d9d138759c8bbc639e8da138803af243
Workflow run: https://github.com/cilium/cilium/actions/runs/8978306359/job/24662060373
PR: #32353

@thorn3r thorn3r added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels May 7, 2024
@joestringer
Copy link
Member

Looking at the timestamps provided by GitHub, it's just the lvh run command that took so long. Not sure if it's possible to tee the output or otherwise access /tmp/console.log from the sysdump - if not, then maybe improving the CI to the point where we can access that would be a first step. Beyond that maybe we could do with an lvh run --debug option to provide additional details.

@thorn3r
Copy link
Contributor Author

thorn3r commented May 7, 2024

No sysdump unfortunately, it's failing too early. Hit another one here: https://github.com/cilium/cilium/actions/runs/8991412734/workflow

@thorn3r
Copy link
Contributor Author

thorn3r commented May 9, 2024

Hit a similar one in Conformance E2E
Workflow: https://github.com/cilium/cilium/actions/runs/9011258778/job/24758598656
PR: #32403

@mhofstetter
Copy link
Member

mhofstetter commented May 15, 2024

Looking at the timestamps provided by GitHub, it's just the lvh run command that took so long. Not sure if it's possible to tee the output or otherwise access /tmp/console.log from the sysdump - if not, then maybe improving the CI to the point where we can access that would be a first step. Beyond that maybe we could do with an lvh run --debug option to provide additional details.

The LVH GitHub Action already outputs the content of /tmp/console.log in case of a failure or timeout: https://github.com/cilium/little-vm-helper/blob/main/action.yaml#L190-L206 (optimized to analyze a previous LVH issue)

error code 41 indicates that starting the LVH SSH server exceeded the default timeout of 300 retries with 1s sleep.

Taking a closer look at the console output of the reported workflow runs it looks like the issue is a parallel write / interleaved output to the console file exactly on the logline that we're trying to grep to verify a successful start of the SSH server. (grep -E ".*OK.*Started.*ssh.*" /tmp/console.log; doesn't match the newline)

https://github.com/cilium/cilium/actions/runs/8978306359/job/24662060373

[  OK  ] Started     4.892581] e2scrub_all (320) used greatest stack depth: 12480 bytes left
1;39mssh.service - OpenBSD Secure Shell server.

https://github.com/cilium/cilium/actions/runs/9011258778/job/24758598656

[  OK      4.925631] e2scrub_all (322) used greatest stack depth: 13128 bytes left
0m] Started ssh.service - OpenBSD Secure Shell server.

https://github.com/cilium/cilium/actions/runs/8991412734/attempts/1

[  OK  ] Started     4.887573] e2scrub_all (319) used greatest stack depth: 12416 bytes left
1;39mssh.service - OpenBSD Secure Shell server.

lvh flag --console-log-file seems to be directly passed on to qemu - i don't know how much we can influence this behavior.

But i can try to optimize the regex to include line breaks - with the hope that the output not breaks in between words as well 😕

edit: or we could simply skip the "ssh availability log check" and only rely on the retries of the SSH connect (and increase the number of connection attempts)

@thorn3r
Copy link
Contributor Author

thorn3r commented May 15, 2024

@mhofstetter
Copy link
Member

mhofstetter commented May 16, 2024

reopening until LVH release (with potential fix in it) lands in cilium/cilium

edit: related dependency update PR: #32566

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants