cpu: Abort attempt to rename a pinned register #1133

vsoriap · 2024-05-14T15:14:48Z

Description of the bug
Running simple SVE code with gather operations, I get the following message:

gem5.opt: build/ARM/cpu/o3/rename_map.cc:85: gem5::o3::SimpleRenameMap::RenameInfo gem5::o3::SimpleRenameMap::rename(const gem5::RegId&): Assertion `arch_reg.getNumPinnedWrites() == 0' failed.
Program aborted at tick 257229480756

Digging in the bug, I've generated a simple Exec trace. What I've observed is that after executing the instruction ld1 {z21}, p0/z, [x4, z32], the kernel launches a page fault. After the page fault, the program recovers and tries to execute the affected instruction again. This second time, gem5 arises the above abort.

Explanation of the Bug

When Gather SVE instructions arrive to Rename, they use pinned registers to avoid renaming the same register for each muop. This code can be found here.

SimpleRenameMap::rename(const RegId& arch_reg)
{
    PhysRegIdPtr renamed_reg;
    // Record the current physical register that is renamed to the
    // requested architected register.
    PhysRegIdPtr prev_reg = map[arch_reg.index()];

    if (arch_reg.is(InvalidRegClass)) {
        assert(prev_reg->is(InvalidRegClass));
        renamed_reg = prev_reg;
    } else if (prev_reg->getNumPinnedWrites() > 0) {
        // Do not rename if the register is pinned
        assert(arch_reg.getNumPinnedWrites() == 0);  // Prevent pinning the
                                                     // same register twice
        DPRINTF(Rename, "Renaming pinned reg, numPinnedWrites %d\n",
                prev_reg->getNumPinnedWrites());
        renamed_reg = prev_reg;
        renamed_reg->decrNumPinnedWrites();
    } else {
        renamed_reg = freeList->getReg();
        map[arch_reg.index()] = renamed_reg;
        renamed_reg->setNumPinnedWrites(arch_reg.getNumPinnedWrites());
        renamed_reg->setNumPinnedWritesToComplete(
            arch_reg.getNumPinnedWrites() + 1);
    }

I've checked in the Rename trace to see that the register was correctly renamed for the first time. It executed for the 16 muops that compose the vector instruction, decrementing the NumPinnedWrites from 16 to 0. However, it raises a page fault when it executes the 11th muop. After the exception, the remaining muops are squashed from the pipeline. Then gem5 executes the following code:

void
DynInst::setSquashed()
{
    status.set(Squashed);

    if (!isPinnedRegsRenamed() || isPinnedRegsSquashDone())
        return;

    // This inst has been renamed already so it may go through rename
    // again (e.g. if the squash is due to memory access order violation).
    // Reset the write counters for all pinned destination register to ensure
    // that they are in a consistent state for a possible re-rename. This also
    // ensures that dest regs will be pinned to the same phys register if
    // re-rename happens.
    for (int idx = 0; idx < numDestRegs(); idx++) {
        PhysRegIdPtr phys_dest_reg = renamedDestIdx(idx);
        if (phys_dest_reg->isPinned()) {
            phys_dest_reg->incrNumPinnedWrites();
            if (isPinnedRegsWritten())
                phys_dest_reg->incrNumPinnedWritesToComplete();
        }
    }
    setPinnedRegsSquashDone();
}

This code increases the number of numPinnedWrites to 5. Expecting that the instruction would be executed from the failing muop.

The second time the SVE gather is executed, the gem5 execution gets in the second case (else if getNumPinnedWrites() > 0) rather than getting inside the final else. Since the first muop has the arch_reg.getNumPinnedWrites() set to 16, an abort arises.

Affects version
Affected in gem5 v22.0.0, but same code exists in gem5 v23.0.0

gem5 Modifications
No modification

To Reproduce
Steps to reproduce the behavior. Please assume starting from a clean repository:

Compile gem5 with ARM ISA
Execute the simulation with fs mode and run an SPMV code vectorized with SVE and a vector length of 2048 bits.

Host Operating System
Ubuntu 22.04

Host ISA
X86

Compiler used
GCC-10

The text was updated successfully, but these errors were encountered:

giactra · 2024-05-20T13:32:15Z

I need to look at this more carefully.
An initial solution (a bit of a hack), would be to handle squashing due to faulting differently from simple re-execution (in the first case we flush the pipeline completely). For example in:

https://github.com/gem5/gem5/blob/stable/src/cpu/o3/dyn_inst.cc#L405

We could add the register pinning handling code. Rather than incrementing the number of pinned writes with

phys_dest_reg->incrNumPinnedWrites()

we could reset it completely. Something like:

for (int idx = 0; idx < numDestRegs(); idx++) {
    PhysRegIdPtr phys_dest_reg = renamedDestIdx(idx);
    if (phys_dest_reg->isPinned()) {
        phys_dest_reg->resetPinnedWrites();
    }
}
setPinnedRegsSquashDone();

Let me know if this works

Note: I might be missing something in the snippet above

vsoriap added the bug label May 14, 2024

giactra self-assigned this May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu: Abort attempt to rename a pinned register #1133

cpu: Abort attempt to rename a pinned register #1133

vsoriap commented May 14, 2024 •

edited

giactra commented May 20, 2024

cpu: Abort attempt to rename a pinned register #1133

cpu: Abort attempt to rename a pinned register #1133

Comments

vsoriap commented May 14, 2024 • edited

giactra commented May 20, 2024

vsoriap commented May 14, 2024 •

edited