Firmware should sanitize the mailbox memory in response to double-bit (uncorrectable) ECC error #1509

calebofearth · 2024-05-13T20:05:47Z

Background

Related to chipsalliance/caliptra-rtl#399

The fix for that logic issue was:

SRAM must be initialized/zeroized at time 0 by the SoC - integrator responsibility
HW fix for the bug is postponed as a 2.0 code improvement

However, (due to the issue described in chipsalliance/caliptra-rtl#340, which is rooted in the same HW logic as chipsalliance/caliptra-rtl#399) there remains a possibility that double-bit ECC errors in Mailbox RAM may trigger error interrupts in subsequent operations even when no error occurred. This is a rare edge case that requires the corrupted dword to lie exactly at the end of the dlen provisioned for future mailbox commands.

Solution/Workaround

When an uncorrectable ECC error is detected, the active command should be failed by Caliptra, and firmware should zeroize the Mailbox SRAM. This will not be necessary after 1.1 hardware.

Caveat

In order to zeroize SRAM, Caliptra firmware must acquire the mailbox lock. As the error may occur at any time, it is possible to encounter contention with the SoC when requesting the lock. Firmware should not enter a simple forever loop trying to acquire the lock, as this may trigger deadlock.

calebofearth · 2024-05-17T20:52:21Z

@korran
A couple of methodology comments/case studies:

General methods

uC can clear mailbox via direct-mode accesses if mailbox state is execute_uc.
uC may need to use mbox_unlock in order to acquire lock in other cases, to guarantee mailbox is cleared.

CASE: ECC error while uC reads dataout

SoC-initiated mailbox command
ECC double-bit error while uC is reading dataout
Action: uC should check error status before writing mbox_status. If ECC error, uC first sanitizes mailbox, then sets CMD_FAILURE. In state mbox_execute_uc, uC is able to sanitize the mailbox using direct mode accesses.

CASE: ECC error while SoC reads dataout (uC-initiated cmd)

uC initiates mailbox command to SoC for, e.g. CSR.
SoC encounters ECC uncorrectable error while reading out data.
SoC should respond with CMD_FAILURE.
Action: uC can sanitize mailbox before clearing lock

CASE: ECC error while SoC reads dataout (uC response to SoC-initiated cmd)

SoC initiates command
uC reads dataout -> no ECC error
uC writes datain (response to cmd)
uC set mbox_status = DATA_READY
SoC reads dataout -> ECC error occurs
state is mbox_execute_soc, so uC is unable to sanitize mailbox
SoC clears lock (by writing mbox_execute = 0)
Action: uC should acquire lock, then sanitize mailbox. uC may fail to acquire lock if SoC wins lock for a subsequent command. uC should use mbox_unlock to forcibly unlock the mailbox and then gain the lock.

calebofearth mentioned this issue May 15, 2024

[UVM] Update validation firmware to force unlock mailbox when sanitizing chipsalliance/caliptra-rtl#514

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Firmware should sanitize the mailbox memory in response to double-bit (uncorrectable) ECC error #1509

Firmware should sanitize the mailbox memory in response to double-bit (uncorrectable) ECC error #1509

calebofearth commented May 13, 2024

calebofearth commented May 17, 2024

Firmware should sanitize the mailbox memory in response to double-bit (uncorrectable) ECC error #1509

Firmware should sanitize the mailbox memory in response to double-bit (uncorrectable) ECC error #1509

Comments

calebofearth commented May 13, 2024

Background

Solution/Workaround

Caveat

calebofearth commented May 17, 2024

General methods

CASE: ECC error while uC reads dataout

CASE: ECC error while SoC reads dataout (uC-initiated cmd)

CASE: ECC error while SoC reads dataout (uC response to SoC-initiated cmd)