CallstackIsntr: Fix callback types #1447

AndrewFasano · 2024-02-19T20:51:00Z

#1445 incorrectly change the callback types without changing how they were registered. Thanks to @LauraLMann and @MarkMankins for pointing this out! I'm slightly terrified that this sort of mistake doesn't cause a compile error (and that all the CI tests still passed)!

I reverted that PR on dev. This PR now includes both those changes + the fix so we can discuss the issue @MarkMankins has found when the intended change is applied.

MarkMankins · 2024-02-19T21:37:55Z

This change combined with #1445 introduces an inconsistency in the output of callstack_instr. Running the same recording multiple times can result in different output from callstack_instr...here's an example of what I'm seeing:

$ diff win2k_ssbranch_string_matches.txt win2k_ssbranch_string_matches_expected.txt                                                       1,5c1,5
< 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f47c 7800f086 78012e72 78012cfa (asid=0x0e32e000)  1
< 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f47c 7800f086 78012e72 78012d0b (asid=0x0e32e000)  1
< 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f47c 7800f086 78012e72 78012d3d 7c4fc311 7c4fc415 7c4fc394 (asid=0x0e32e000)  2
< 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f47c 7800f086 78012e72 78012d3d 7c4fc311 7c4fc415 7c4fc3bc 77f891d2 80465691 8049d72e 80402408 (asid=0x00000000)  2
< 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f47c 7800f086 78012e72 78012d3d 7c4fc311 7c4fc415 7c4fc3bc 77f891d2 80465691 8049d947 80402408 (asid=0x00000000)  2
---
> 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f086 78012e72 78012cfa (asid=0x0e32e000)  1
> 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f086 78012e72 78012d0b (asid=0x0e32e000)  1
> 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f086 78012e72 78012d3d 7c4fc311 7c4fc415 7c4fc394 (asid=0x0e32e000)  2
> 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f086 78012e72 78012d3d 7c4fc311 7c4fc415 7c4fc3bc 77f891d2 80465691 8049d72e 80402408 (asid=0x00000000)  2
> 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f086 78012e72 78012d3d 7c4fc311 7c4fc415 7c4fc3bc 77f891d2 80465691 8049d947 80402408 (asid=0x00000000)  2

AndrewFasano · 2024-02-19T21:40:39Z

Oh no, I was expecting a difference with output from callstack_instr before/after these changes, but you're saying it's just producing varying output just running repeatedly with these changes?

MarkMankins · 2024-02-19T21:43:06Z

Oh no, I was expecting a difference with output from callstack_instr before/after these changes, but you're saying it's just producing varying output just running repeatedly with these changes?

Correct - different consistent output would not have surprised me - but I'm seeing output that varies between different runs of panda replaying the same recording file.

AndrewFasano · 2024-02-19T21:47:30Z

Yikes, that's no good. I'm going to revert #1445 on dev, and rebase all the commits into this branch/PR to keep this self contained.

@lacraig2 you're the SBE/EBE callback expert, do you have any idea why this might occur? Do EBE callbacks get triggered if a block's execution is interrupted?

Commit c575951 without #1447 was incomplete. And with #1447 it's broken.

lacraig2 · 2024-02-20T02:31:37Z

@AndrewFasano EBE callbacks do not occur when an exit is requested. SBE/EBE are injected into the TCG stream and only occur when execution reaches that point.

I am skeptical that this is the reason that we're experiencing an issue here. The recordings I've taken have tens of millions of instructions and tend to have tens of exits.

I've done some analysis:

#!/usr/bin/env python3
from sys import argv
from pandare import Panda

# Single arg of arch, defaults to i386
arch = "arm" if len(argv) <= 1 else argv[1]
panda = Panda(generic=arch)
from os.path import exists

recording_name = "rec"+arch
if not exists(f"{recording_name}-rr-snp"):
    print("taking recording")
    @panda.queue_blocking
    def do_stuff():
        panda.revert_sync("root")
        panda.run_monitor_cmd(f"begin_record {recording_name}")
        print(panda.run_serial_cmd("cat /etc/passwd && sleep 2 && cat /etc/passwd && ls -la"))
        panda.run_monitor_cmd("end_record")
        panda.stop_run()
    panda.run()
else:
    print("recording exists. not remaking recording")


bbe = 0
@panda.cb_before_block_exec
def bbed(cpu, tb):
    global bbe
    bbe += 1

abe = 0
num_exit = 0

@panda.cb_after_block_exec
def abed(cpu, tb, ec):
    global num_exit,num_exit_tb_exit
    if ec > 1:
        num_exit += 1
        
    global abe
    abe += 1

sbe = 0
@panda.cb_start_block_exec
def sbed(cpu, tb):
    global sbe
    sbe += 1

ebe = 0
@panda.cb_end_block_exec
def ebed(cpu, tb):
    global ebe
    ebe += 1

panda.enable_tb_chaining()
panda.run_replay(recording_name)

print(f"{bbe=}, {sbe=}")
print(f"Number missing: {bbe-sbe=}")

print(f"{abe=}, {ebe=}")
print(f"Number missing: {abe-ebe=}")

print(f"Starts vs ends: {sbe-ebe=}, {bbe-abe=}")



print(f"{num_exit=} ")

Which produces results like this:

bbe=3128690, sbe=3128678
Number missing: bbe-sbe=12
abe=3126280, ebe=3126280
Number missing: abe-ebe=0
Starts vs ends: sbe-ebe=2398, bbe-abe=2410
num_exit=12

To me this indicates that the block is checking for the exit status ahead of start_block_exec. The fact that this only occurs 12 times in the entire recording makes me skeptical that this is indeed the issue.

This does appear to be something you can check because even if the block is exited early our after_block_exec callback occurs. This means you could simulate this particular condition by enabling after_block_exec as well and only examining the output if the exit condition is greater than 1.

AndrewFasano · 2024-02-20T02:58:20Z

Thanks Luke, that's helpful info. I'm glad the EBE callback isn't called when a block is interrupted, that's a nice improvement over ABE!

I can try building some more tests for this PR later, but I'm not sure how this PR could (in itself) cause the non-deterministic behavior unless something it's built on is more fundamentally broken 😨

The old logic would miss many calls if tb_chaining was enabled (which is is by default). Since this plugin did not disable tb_chaining, many calls would be missed if a user didn't disable chaining or load another plugin that disabled chaining. This commit updates the plugin to use start_block_exec and end_block_exec which work even with tb_chaining enabled.

MarkMankins · 2024-03-06T18:12:29Z

I tested with the latest code and I'm still seeing issues where callstack_instr outputs different results for the same replay.

AndrewFasano · 2024-03-06T19:37:44Z

I haven't forgotten about this, just haven't had time to track down the issue. Things are working correctly on the main dev branch, right?

MarkMankins · 2024-03-06T19:38:48Z

Yes, no issues on dev.

MarkMankins · 2024-03-06T19:40:38Z

Oh - sorry - didn't realize the recent push was just a rebase - I thought you made some changes - ignore me. :)

lacraig2 · 2024-03-06T23:52:20Z

I think I know what is going wrong here.

It looks like our implementation of start_block_exec runs prior to the first instruction, but not prior to the initial TCG block elements.

In this case those initial TCG checks mean that if an exit event has occurred we would generate a before_block_exec, but not a start_block_exec.

These are fairly rare, but over the course of millions of blocks it will occur.

I had started testing out a change to start_block_exec that adds functionality earlier in the block, but I ran into some issues with LLVM generation failing. I'll likely follow up with a PR.

AndrewFasano · 2024-03-07T08:45:28Z

@lacraig2 so if that's the issue it would mean the SBE callback is sometimes (non-deterministically) triggered once when it should be triggered twice? Are we going to have to add an exitCode to the EBE callback to indicate if a block was ended early and then just keep the existing callstack instr logic for filtering out double calls.

lacraig2 · 2024-03-12T00:30:51Z

I think I've found the issue. It seems like SBE should check panda_exit_loop.

I've implemented this as a separate PR #1471, but I have cherry picked it onto this branch so that @MarkMankins can check it with his tests. I have already confirmed that this fixes the issues for the file you sent me.

Once we confirm that this works I'd like to revert this commit and merge #1471 and then this PR.

AndrewFasano enabled auto-merge (rebase) February 19, 2024 20:51

AndrewFasano mentioned this pull request Feb 19, 2024

CallstackInstr: switch to SBE/EBE callbacks #1445

Merged

AndrewFasano disabled auto-merge February 19, 2024 21:40

AndrewFasano added a commit that referenced this pull request Feb 19, 2024

Revert "CallstackInstr: switch to SBE/EBE callbacks"

e29f574

Commit c575951 without #1447 was incomplete. And with #1447 it's broken.

AndrewFasano force-pushed the callstack_instr_fix2 branch from 89b4117 to ad4487d Compare February 19, 2024 21:49

AndrewFasano force-pushed the callstack_instr_fix2 branch from ac4e71f to d152c22 Compare March 5, 2024 22:45

lacraig2 mentioned this pull request Mar 11, 2024

Fix start_block_exec nondeterminism #1471

Open

fix sbe nondeterminism

17dda30

AndrewFasano mentioned this pull request Apr 1, 2024

Callstack_instr: disable TB chaining to improve accuracy #1485

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CallstackIsntr: Fix callback types #1447

CallstackIsntr: Fix callback types #1447

AndrewFasano commented Feb 19, 2024 •

edited

MarkMankins commented Feb 19, 2024 •

edited

AndrewFasano commented Feb 19, 2024

MarkMankins commented Feb 19, 2024

AndrewFasano commented Feb 19, 2024

lacraig2 commented Feb 20, 2024

AndrewFasano commented Feb 20, 2024

MarkMankins commented Mar 6, 2024

AndrewFasano commented Mar 6, 2024

MarkMankins commented Mar 6, 2024

MarkMankins commented Mar 6, 2024

lacraig2 commented Mar 6, 2024

AndrewFasano commented Mar 7, 2024

lacraig2 commented Mar 12, 2024

CallstackIsntr: Fix callback types #1447

Are you sure you want to change the base?

CallstackIsntr: Fix callback types #1447

Conversation

AndrewFasano commented Feb 19, 2024 • edited

MarkMankins commented Feb 19, 2024 • edited

AndrewFasano commented Feb 19, 2024

MarkMankins commented Feb 19, 2024

AndrewFasano commented Feb 19, 2024

lacraig2 commented Feb 20, 2024

AndrewFasano commented Feb 20, 2024

MarkMankins commented Mar 6, 2024

AndrewFasano commented Mar 6, 2024

MarkMankins commented Mar 6, 2024

MarkMankins commented Mar 6, 2024

lacraig2 commented Mar 6, 2024

AndrewFasano commented Mar 7, 2024

lacraig2 commented Mar 12, 2024

AndrewFasano commented Feb 19, 2024 •

edited

MarkMankins commented Feb 19, 2024 •

edited