Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eth_call incorrectly reconciling deferred (Filecoin) and immediate (Ethereum) execution #11205

Open
raulk opened this issue Aug 24, 2023 · 13 comments · May be fixed by #11905
Open

eth_call incorrectly reconciling deferred (Filecoin) and immediate (Ethereum) execution #11205

raulk opened this issue Aug 24, 2023 · 13 comments · May be fixed by #11905
Assignees
Labels

Comments

@raulk
Copy link
Member

raulk commented Aug 24, 2023

Context: cerc-io/stack-orchestrator#509 (comment)

Ethereum clients and libraries expect that calling eth_call with a given epoch (e.g. 1234) will apply the call on the output state of the supplied epoch. In Ethereum, the output is computed immediately and the root hash is part of the block header of height 1234. In Filecoin, deferred execution causes the output of epoch 1234 to be calculated on the next non-null round, normally 1235 (but not necessarily).

The logic here is flawed.

  • The correct behaviour is to apply the call on the parent state root of the next tipset (execution tipset).
    • If this supplied epoch is the "head" epoch, we should error out, thus preserving consistency with the rest of the Eth module (ie. the tipset has still not been executed and is not "exposed" to Eth endpoint clients).
  • We should not be doing any trickery with individual messages (e.g. finding prior messages, filtering, etc.) down the line, at all (see callees of this function).
  • The LOTUS_SKIP_APPLY_TS_MESSAGE_CALL_WITH_GAS env var introduced in perf: eth: gas estimate set applyTsMessages false #10546 must go.
    • This causes inconsistent behaviour across RPC endpoints in the ecosystem depending on how this env var is configured.
    • When enabled, it deliberately fails to meet the required functionality / expectations of eth_call.
    • Due to a chain of circumstances, all new RPC endpoints being set up in the Filecoin ecosystem have this flag enabled.
    • Therefore, all new endpoints are failing to meet expectations.
@Stebalien
Copy link
Member

Stebalien commented Aug 24, 2023

Yeah, this logic is definitely wrong.

@Stebalien
Copy link
Member

Ah, ok, I remember why we did this.

The old logic was "as correct as we could make it", at least without significant work. The issue is that:

  1. If we execute on the "final" parent state tree, we need to execute at epoch+1. Otherwise, we'll execute at an epoch where cron has already run (which means all the miner actors will be in an unexpected state which could cause issues estimating window post gas).
  2. To execute at epoch+1, we'd need a tipset at epoch+1. Otherwise, we'll have issues fetching tipset CIDs and randomness (this was causing Eth transactions to fail).

So, we went with the option of applying all messages in the specified epoch, then applying the user's message at the specified epoch.

The alternative would be to make some form of "fake" tipset and/or hack the lookback logic to make everything work even when we're executing at the wrong tipset. Which.. it sounds like we need to do.

One remaining question is: does eth_call invoke the passed message at the specified epoch, or at epoch+1?

@Stebalien
Copy link
Member

Ok, this shouldn't be difficult to fix in the ethereum case.

@raulk
Copy link
Member Author

raulk commented Aug 24, 2023

Thanks for the extra context!

If we execute on the "final" parent state tree, we need to execute at epoch+1. Otherwise, we'll execute at an epoch where cron has already run (which means all the miner actors will be in an unexpected state which could cause issues estimating window post gas).

I see. Yes, this could lead to inconsistency, e.g. if the eth_call ends up accessing some miner or market state that has changed with the execution of cron. But we might have to bite the bullet and put a warning label on this API operation.

The alternative would be to make some form of "fake" tipset and/or hack the lookback logic to make everything work even when we're executing at the wrong tipset. Which.. it sounds like we need to do.

Are we able to achieve something like this without increasing the computational and IO cost of eth_call, though? This call is commonplace, supposed to be light, and the expected latency is low.

One remaining question is: does eth_call invoke the passed message at the specified epoch, or at epoch+1?

In Ethereum, eth_call requests are executed on top of the resulting state root of the block at the specified height, and the visible height from within EVM bytecode is the same height.

@Stebalien
Copy link
Member

So, an issue here is:

If this supplied epoch is the "head" epoch, we should error out, thus preserving consistency with the rest of the Eth module (ie. the tipset has still not been executed and is not "exposed" to Eth endpoint clients).

We currently "support" pending so we'll need to be careful not to break users.

One remaining question is: does eth_call invoke the passed message at the specified epoch, or at epoch+1?

In Ethereum, eth_call requests are executed on top of the resulting state root of the block at the specified height, and the visible height from within EVM bytecode is the same height.

Are you sure because my research indicates otherwise (although I haven't actually tested it).

@Stebalien
Copy link
Member

Here's what I currently have: #10462

But see the question at the end.

@raulk
Copy link
Member Author

raulk commented Aug 24, 2023

Are you sure because my research indicates otherwise (although I haven't actually tested it).

Yep. Here's a reference: https://github.com/ethereum/go-ethereum/blob/1a2135044cd498039be46e1b611627665ff6c4bc/internal/ethapi/api.go#L1109. This loads the state with the root specified in the block header at the requested height, which in Ethereum is the post root state. And the height fed into the EVM is the supplied height.

@Stebalien
Copy link
Member

Great. That's what I get for trusting stack overflow...

So.... we have two options:

  1. Tell people not to set that env var. That does, in fact, do the right thing from the perspective of Ethereum by applying all messages in the tipset, followed by the specified message. But it's slow.
  2. Just.... apply the message after cron but "fake" the block number? But that may mess some stuff up in miner actors with respect to deadlines. Maybe that doesn't matter for ethereum?
  3. Do 1, but cache it.

3 is, disgustingly, the most correct option at least as far as ethereum is concerned, but I really don't want to do that as it'll be an ETH JSON-RPC API specific feature.

@AFDudley
Copy link

We should discuss further in slack, but we are currently working on a filecoin grant to add the indexing required to support this sort of usage.
vulcanize/filecoin-indexing#4

@aarshkshah1992
Copy link
Contributor

So here's what I think the problem is:

  • According to the specifications for the Ethereum RPC API, the eth_call API is designed to execute the user's transaction against the state as it exists following the completion of the user-specified block. This implies that the state used to apply the users transaction must be the state that comes after executing all transactions contained in the block

  • Now, of course, Filecoin has deferred execution, but we still need to respect the eth_call RPC behaviour. So, here's what we do in Filecoin:

    • In the current implementation, Filecoin's eth_call API takes the parent state root of the user specified block
    • It then applies all the transactions in the user specified block on top of the parent state root to get the new state at the end of the block and then applies the user's ethcall transaction on top of it -> This gets us ETH like behaviour
    • Note 1: The Cron Actor in Filecoin is executed at the end of every epoch and the state is updated accordingly. The subsequent ParentStateRoot will reflect the result of applying the cron actor actions. But in the case of eth_call, the transaction will be applied to the end state without applying any Cron actor action. But this is fine.
    • Note2: that the LOTUS_SKIP_APPLY_TS_MESSAGE_CALL_WITH_GAS flag here can play a spoil sport for obvious reasons but this is not hard to fix. We can just ignore this flag for the eth_call API
    • All of the above works fine and as expected (please correct me if I am wrong). The real problem here is that this is slow as we're re-applying each transaction in the block
  • A better way of doing this would have been to take the parent state root of the tip set following the eth_call epoch.
    That state would contain the result of all the transactions in the user specified epoch and so we wouldn't have to re-execute any transactions

    • Note: There would be a semantic change here as this state will ALSO contain the result of running the cron Actor (is this bad ? Are we okay running the eth call on top of the state that comes after the cron actor is run as long as we can guarantee that we're running it at the immediate next block after the user specified block ?)
  • But now the problem here is that if there are single/multiple null tipsets after the user specified epoch, then the state on which the eth_call transaction will be applied will also contain the result of the execution of the cron actor on all those null tipsets because cron actor is executed every epoch and even for null tipsets. This means that the state on which the eth_call transaction is applied is not exactly the state at the end of the user specified epoch. That is, the final state on which the eth_call transactions is applied could contain the (state from executing the transactions in the user specified block + changes from multiple cron actor invocations at multiple null tipsets in multiple epochs) -> This is bad because it breaks ETH behaviour and could make the state hard for the user to reason about.

Here's what the solution can be

  • First of all -> ignore the LOTUS_SKIP_APPLY_TS_MESSAGE_CALL_WITH_GAS flag in the eth_call API
  • Do nothing else -> eth_call behaviour remains correct BUT slow
  • Apply the eth_callon top of the ParentStateRoot of the first non-null tipset following the user specified block -> Ship the API with a warning about cron actor state shenanigans but this should be fine as I am not sure how much the ETH API/ETH transactions care about cron actor state correctness

@raulk
Copy link
Member Author

raulk commented Apr 16, 2024

Hey @aarshkshah1992 -- this is definitely a rough edge, and it's currently coming back to haunt us at IPC 👻 (cc @aakoshh @snissn @cryptoAtwill).

We're not necessarily concerned about the cron actor state per-se. In fact, the cron actor state only contains system-defined cron entries, and therefore can be deemed static from a practical viewpoint.

The problem is with all the remaining built-in actors state objects that are modified as a result of cron ticks either directly or indirectly (e.g. miner actor, storage power actor, etc.)

Earlier in this discussion I suggested to call it a day and slap a warning on eth_call, but since then I have acquired more understanding on how certain Filecoin apps like lending pools are designed, and I now suspect this can be dangerous.

To illustrate the scenario, imagine you are lending pool operator with a backend that monitors the chain for accounting or decision-making via a combination of events and eth_calls (e.g. something very common when using The Graph). This case is pretty extreme, but it exposes the correctness flaw:

  1. You are happily polling some smart contracts by calling eth_call on a contract function on every height. That contract function reports the risk level at a particular height. It does so by monitoring a bunch of facts about the miner, e.g. including its RBP on the power actor.
  2. The chain suffers a sequence of null rounds at heights 1000-1050, but you are not aware of them because you're a poor Eth client and the Eth chain has no such concept.
  3. During this blackout of null rounds, the SP had failed to PoSt and some sectors got terminated, in the cron tick at height 1025. This is enough to flip the risk level as perceived by the lending pool from "none" to "high".
  4. If we fix this by performing the eth_call on the ParentStateRoot of the first non-null round (height 1051), we will mislead the application into thinking that the miner reached the "high" risk level at epoch 1000, when it reality it reached it at epoch 1025.

Instead, I propose the following solution:

  1. Acknowledge that null rounds are a rare event, at least in mainnet. In networks where they're more frequent, they also have lower transaction volumes (e.g. Calibrationnet).
  2. When a null round is detected, fall back to the old "apply ts messages" technique. Since null rounds are rare, the fallback will be applied sparingly and we don't expect it to be problematic. We could potentially cache these roots, but probably not worth it.
  3. For everything else, use the successor tipset ParentStateRoot.

@Stebalien
Copy link
Member

Stebalien commented Apr 17, 2024

So... I think there's a much bigger problem: invisible reorgs. I can call eth_call on the same tipset multiple times but get different results depending on whether or not the tipset is followed by a null round. I.e.:

  1. I have synced: A <- B.
  2. I call eth_call, and see the result of applying (msgs(a)..., cron(a), my_message).
  3. I reorg to: A <- Null <- B.
  4. I call eth_call, and see the result of applying (msgs(a)..., my_message).

This is an issue because:

  1. The tipset (and eth block hash) hasn't changed.
  2. Balances can change because cron may have fined some Miner actors.

No solution involving post-cron state will fix this, only "apply all the messages" is correct. I think we may need a lot of caching (and ideally an on-disk cache).

@Stebalien
Copy link
Member

Ah, nvm. We're all making some incorrect assumptions about cron. The ParentStateRoot does not include cron catch up. When we execute a tipset:

  1. We execute cron catch up for null rounds.
  2. Execute the tipset (current epoch)
  3. Execute one cron round (current epoch).

So:

So... I think there's a much bigger problem: invisible reorgs. I can call eth_call on the same tipset multiple times but get different results depending on whether or not the tipset is followed by a null round. I.e.:

This isn't an issue because ParentStateRoot does not depend on the number of trailing null tipsets.

If we fix this by performing the eth_call on the ParentStateRoot of the first non-null round (height 1051), we will mislead the application into thinking that the miner reached the "high" risk level at epoch 1000, when it reality it reached it at epoch 1025.

We'll blame epoch 1051 which is, IMO, correct.


I think the answer is to always use the next block's ParentStateRoot and just accept that eth_call will apply messages after cron. This won't be entirely accurate because the call might have produced different results if we had executed it as a part of the block but... given how users actually use eth_call, we should be fine.

raulk added a commit to consensus-shipyard/ipc that referenced this issue Apr 19, 2024
Fendermint was pulling parent effects only until committed_finality-1.

This choice assumed that chains like Filecoin leaked deferred execution
mechanics through the query mechanism we use (Eth JSON-RPC API). Therefore
it was deemed that the safe choice was to lag behind by one epoch. However,
we can safely assume that Eth JSON-RPC APIs will only expose blocks that
have been executed and whose side-effects are available (unless "pending"
is requested). Filecoin reconciles its deferred execution model with
Eth client expectations of immediate execution, alas with some known rough
edges [1].

Furthermore, this choice assumed that our chain_head_delay can be 0, i.e.
we can consume the exposed HEAD of the parent in real time. In practice,
that's not a good idea, ever.

All in all, this is problematic because it's a hidden. Users inspecting
the committed parent finality and seeing N now need to understand that
events emitted in N are (counterintuitively) not imported yet, and they
need to wait longer.

This PR:
- Rectifies the queried range so that we import up to, and including,
  the committed parent finality.
- Disallows a zero chain_head_delay.

Note: I'm clarifying with @aakoshh the mechanics of Fendermint's Eth
JSON-RPC API; concretely, if it renconciles deferred execution or not.
There are some off-by-one shifts assumed in a number of places, so
unclear. However, by disallowing a chain_head_delay of 0, we are able
to work with APIs that expose heights before their results are available.

Note 2: when we adopt light clients to watch the parent, some assumptions
may change.

[1] filecoin-project/lotus#11205
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants