Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: sidecar_front_io::transceivers::wait_and_check_i2c is not checking all ports #1754

Closed
Aaron-Hartwig opened this issue Apr 18, 2024 · 3 comments · Fixed by #1756
Closed
Assignees

Comments

@Aaron-Hartwig
Copy link
Contributor

On the morning of 2024-04-18 @Nieuwejaar was updating the dogfood rack and @citrus-it noticed no transceivers were showing up:

root@oxz_switch0:~# swadm sp transceiver ls
Port  State  Reset  Interrupt  Power  Power control  Vendor  Part No.  Serial No.
root@oxz_switch0:~#

Andy then noticed the transceivers task repeatedly crashing on the Sidecar in question:

jeeves% humility dump --list
humility: connecting to fe80::aa40:25ff:fe05:400%10
humility: using UDP dump agent
AREA TASK                  TIME       SIZE
   0 transceivers          1766657    6384
   2 transceivers          1766758    6164
   4 transceivers          1766859    6164
   6 transceivers          1772528    6348
   8 transceivers          1772631    6156
  10 transceivers          1772731    6156
  12 transceivers          1772889    6304
  14 transceivers          1772992    6156
  16 transceivers          1773093    6156
  18 transceivers          1773270    6336
  20 transceivers          1773371    6160
  22 transceivers          1773472    6160
  24 transceivers          1773610    6300
  26 transceivers          1773711    6156
  28 transceivers          1773813    6156
  30 transceivers          1773971    6304
  32 transceivers          1774072    6156
  34 transceivers          1774173    6156
  36 transceivers          1774332    6300
  38 transceivers          1774433    6144
  40 transceivers          1774534    6144
  42 transceivers          1774705    6332
  44 transceivers          1774806    6160
  46 transceivers          1774907    6160
  48 transceivers          1775076    6304
  50 transceivers          1775179    6148
  52 transceivers          1775280    6148
  54 transceivers          1775437    6304

The panic: https://github.com/oxidecomputer/hubris/blob/master/drv/sidecar-front-io/src/transceivers.rs#L1237

ID TASK                       GEN PRI STATE
15 transceivers                16   6 FAULT: panicked at drv/sidecar-front-io/src/transceivers.rs:1237:21:
index out of bounds: the len is 8 but the index is 9 (was: ready)

This is because we are erroneously only reading 8 ports of status bytes instead of all 16.

So, why did this not happen before #1588? Previously, each port's status information was in a more condensed format and we could fit 2 ports of status per each byte. When that changed to be 1 port per byte, the code to handle status was never updated. #1588 then changed that to be one byte per port without updating the BusyAndPortStatus struct to read 16 status bytes instead of 8.

@Aaron-Hartwig
Copy link
Contributor Author

I've reproduced this on niles:

aaron@niles ~ $ pfexec humility -t sidecar-b -a build-sidecar-b-image-default.zip tasks
humility: WARNING: archive on command-line overriding archive in environment file
humility: attached to 0483:3754:002600184D4B500E20373831 via ST-Link V3
system time = 31818
ID TASK                       GEN PRI STATE
...
15 transceivers               117   6 FAULT: panicked at 'index out of bounds: the len is 8 but the 
index is 8', drv\sidecar-front-io\src\transceivers.rs:1235:21 (was: ready)
...

With the patch from #1756 the task no longer panics.

@Aaron-Hartwig
Copy link
Contributor Author

This bug became immediately apparent when testing in a context where the Sidecar has a Scrimlet and transceivers in the out of bounds ports (8..15, 24..31). All testing of #1588 was done on a Sidecar setup where this was not the case. So for the hardware-in-loop future, a setup where we have at least one transceiver in each range of 0..7, 8..15, 16..23, 24..32 can help expose if there was a bug introduced by the various logical-physical mappings at play here.

@Aaron-Hartwig
Copy link
Contributor Author

The alternative to the Sidecar/Scrimlet setup would just a Sidecar connected to a host with xcvradm present. The flow to reproduce would be to run an xcvradm command that results in I2C communication to a transceiver in each of the previously mentioned ranges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant