Remove provider checks at startup #5337

mangas · 2024-04-11T09:09:13Z

New chain now created with ChainIdentifier::default() on start
Adapter genesis+net_version now checked by ProviderManager
Adapter checks will be disabled temporarily after an intermittent failure
Adapter won't be returned by the ProviderManager if the ident returned is not the expected value.
BlockIngestor now takes a ChainClient and tries to get an adapter with each poll, this means it won't stop if the first poll fails
node/src/main.rs refactored so that most of the init code is automatically executed by the new Networks type.
graphman run now uses the same init code as the node, should be able to run using any chain
EthereumNetworks type removed
FirehoseNetworks type removed
Update ident on first check from new

Future improvements:

Cache genesis once it's set
Cache chain ident once it's set

lutter

Not a full review (it's Friday evening and I am running out of time) but I wanted to provide some feedback for the PR. It feels like this could be simplified quite a bit by restructuring how it works and changing code so that it can rely on basic assumptions like 'the store is there and we can talk to it'

To elaborate a little more what I mean: my basic intuition is that whatever structs we use today to represent endpoints now become enums that roughly look like

enum Endpoint {
    NotReady(details)
   Ready(details)
   // .. maybe other states to help with bookkeeping
}

and methods on Endpoint that, for example, make a request, might become fallible, i.e. go from fn foo(&self) -> Something to fn foo(&self) -> Result<Something, Error> where foo then needs to check for the NotReady case if the endpoint looks ok, and either return an error if it doesn't or transition to the Ready case when it does. That's a little too simplistic since it involves mutation.

But most of these operations happen on a pool of endpoints, and the manager for that pool, like ProviderManager can mediate the mutation. Instead of calling methods directly on the endpoint, call it on the manager which can then call Endpoint.get_ready(self) -> Result<Self, Error> and hide the complexity of readiness checking and going from not ready -> ready from the outside world. In that world, you'd call foo on the provider manager, maybe with some hints on how to select an appropriate endpoint, and it would succeed if it could find or make an endpoint ready, and fail if none of them are ready.

lutter · 2024-05-18T00:47:53Z

chain/ethereum/src/network.rs

+            manager: ProviderManager::default(),
+            call_only_adapters: vec![],
+            retest_percent: DEFAULT_ADAPTER_ERROR_RETEST_PERCENT,
+        }
    }
 }


Why do we need all these default implementations? I think the original one was already a bad idea, and trying to keep up with that just forces us to have more and more basically meaningless default impls. I would always try to make sure that when a struct is constructed, it also has meaningful functionality.

Playing with the code a little, it seems that the associated Client type for Blockchain needs to be default because ChainClient::new_firehose needs a client to pass into ChainClient::new. I think this could all be avoided if new_firehose returned Result<Self,..> and just barfs if the firehose_available check fails - that avoids a misconfiguration causing us to have a dummy rpc endpoint

I usually add these types of defaults when the setup is not super straightforward so that anyone using it doesn't have to dive into how the struct works to use.

Turns out new was almost not used so I just removed it along with the default for EthereumNetworkAdapters

Nice. In general, it doesn't make sense to have a default implementation if it can't hold up the same assumptions as an object constructed 'the normal way'. When you've managed to construct an object, there should be assumptions that are always true. Otherwise, when you read code that uses that object you always have to wonder if this is a real one or just a dummy. But it seems in this case that's now moot.

lutter · 2024-05-18T00:55:53Z

chain/ethereum/src/network.rs

+    // pub fn remove(&mut self, provider: &str) {
+    //     self.manager.get_all_sync(&self.chain_id).unwrap_or_default()
+    //         .retain(|adapter| adapter.adapter.provider() != provider);
+    // }


Left over from testing?

Was left over from another layer around endpoints that was removed 😛

lutter · 2024-05-18T00:56:27Z

chain/ethereum/src/network.rs

+//             .ok_or(anyhow!("network not supported: {}", &network_name))
+//             .and_then(|adapters| adapters.cheapest_with(requirements))
+//     }
+// }


Is this code still needed?

lutter · 2024-05-18T01:00:13Z

graph/src/blockchain/mod.rs

@@ -510,18 +512,36 @@ impl BlockchainKind {

 /// A collection of blockchains, keyed by `BlockchainKind` and network.
 #[derive(Default, Debug, Clone)]
-pub struct BlockchainMap(HashMap<(BlockchainKind, String), Arc<dyn Any + Send + Sync>>);
+pub struct BlockchainMap(pub HashMap<(BlockchainKind, ChainId), Arc<dyn Any + Send + Sync>>);


Do we need to make this pub? Could we get by with adding some accessor method to BlockchainMap instead?

added an iter accessor

lutter · 2024-05-18T01:00:51Z

graph/src/components/adapter.rs

+
+use crate::components::store::{BlockStore as BlockStoreTrait, ChainStore as ChainStoreTrait};
+
+const VALIDATION_ATTEMPT_TTL_SECONDS: i64 = 60 * 5;


Would be better to make that a Duration. That'll also save you the _SECONDS postfix ;)

lutter · 2024-05-18T01:08:10Z

graph/src/components/adapter.rs

+            .ok_or_else(|| IdentValidatorError::UnavailableStore(chain_id.clone()))?;
+        let store_ident = network_chain.chain_identifier();
+
+        if store_ident == &ChainIdentifier::default() {


This should be something like store_ident.is_unset(), and would ideally be implemented by making ChainIdentifier an enum that has an Unset variant; or even better make ChainStore.chain_identifier return a Result<Option<ChainIdentifier>, Error>. Relying on the convention that the default means 'not really there' is very brittle.

The reason I went with this is basically because the store does not have a way of expressing "unset", it won't be null, since the field is not nullable. This means that to return the Option we'd just hide this assumption further down in the store. The main issue with that IMO is that the store should not know anything about set/unset unless that can be expressed in the data model.

While this is not strictly true because the store sets the default (since it isn't passed one) I feel like the better place fro all these decisions is in the IdentValidator which would be responsible for all of this. I agree relying on default values is a bit brittle but in this case is a complexity trade-off to avoid a migration in a somewhat critical place.

I see two options for addressing that:

Change the data model to allow null in those columns. That then requires that all code handles the possibility that this is null

Insist that for new chains we can actually talk to an endpoint to set these fields; if not the chain is ignored like it is today. I think that would be a reasonable requirement, that adding a new chain requires a bit more oversight. It's also not that big a deal if the new chain gets disabled as there can not be any subgraphs for it.

My preference would be (2)

I looked into it and I think it's needs to change a bit again because we will need remove the assumption one will always be returned and then expose a way of adding the store. So I think I will do that on a following PR, I feel like this implementation is ok enough as a first step I think it could be merge as is to make the follow update easier to review and get merged ?

lutter · 2024-05-18T01:13:07Z

node/src/config.rs

+            rpc_provider_manager: ProviderManager::new(
+                Logger::root(Discard, o!()),
+                vec![].into_iter(),
+                Arc::new(MockIdentValidator),


Can't this be the real validator? If there's no endpoints, the validator won't have to do anything. I don't like having all these traits floating around where whether they do anything or not depends on setup and you're never quite sure if the setup did things correctly.

I've only implemented this for BlockStore and then the different test/mock implementations, I could add a Noop implementation but otherwise we don't have the necessary information to create it without arguments. This struct is not supposed to be mutated manually so the fact it it generated as noop is just intended for situations where it is expected to be used as noop/empty. Not sure yet another empty implementation is really useful, what do you think?

lutter · 2024-05-18T01:16:06Z

node/src/config.rs

 impl Config {
+    pub async fn networks(


This module was really only meant for handling the configuration file and any checks that can be performed on the config file without any other setup. It feels like some of the headaches this needs to address is that, because Config has very little setup around it, the code needs to work both for when nothing has been set up and for when we actually have things like a store available.

It might be better to restructure this so that all this network setup code gets run from, say, a NetworkSetup struct that gets a config and whatever else is needed (like the BlockStore) and can then do its thing knowing that the store exists already.

This is basically what Networks is, I could move it to it's own module and add a from_config instead of having config.networks(), this was just for convenience, I don't feel strongly either way.

Yes, I think putting it into its own module would be better. It also seems to me that requiring having a BlockStore here would simplify the logic.

lutter · 2024-05-18T01:18:12Z

node/src/main.rs

-    EthereumAdapterSelector, EthereumBlockRefetcher, EthereumRuntimeAdapterBuilder,
-    EthereumStreamBuilder,
-};
-use ethereum::{BlockIngestor, EthereumNetworks};


Nice! I like that this tightens up main overall - it had way too much control flow mixed with intricate but somewhat boring detail.

mangas · 2024-05-20T10:57:03Z

Not a full review (it's Friday evening and I am running out of time) but I wanted to provide some feedback for the PR. It feels like this could be simplified quite a bit by restructuring how it works and changing code so that it can rely on basic assumptions like 'the store is there and we can talk to it'

To elaborate a little more what I mean: my basic intuition is that whatever structs we use today to represent endpoints now become enums that roughly look like
enum Endpoint {
    NotReady(details)
   Ready(details)
   // .. maybe other states to help with bookkeeping
}
and methods on Endpoint that, for example, make a request, might become fallible, i.e. go from fn foo(&self) -> Something to fn foo(&self) -> Result<Something, Error> where foo then needs to check for the NotReady case if the endpoint looks ok, and either return an error if it doesn't or transition to the Ready case when it does. That's a little too simplistic since it involves mutation.

But most of these operations happen on a pool of endpoints, and the manager for that pool, like ProviderManager can mediate the mutation. Instead of calling methods directly on the endpoint, call it on the manager which can then call Endpoint.get_ready(self) -> Result<Self, Error> and hide the complexity of readiness checking and going from not ready -> ready from the outside world. In that world, you'd call foo on the provider manager, maybe with some hints on how to select an appropriate endpoint, and it would succeed if it could find or make an endpoint ready, and fail if none of them are ready.

Thanks for the review!

There are a couple of nuances on how we use the adapters, for example, we have different types for firehose and for rpc, one of these layers take care of stuff like falling over to a different adapter if one has a high error rate or handling a usage cap per adapter, these also handle some details like turning a single adapter into a pool and things like that.

These 2 types (firehose and rpc) have very different interfaces so I think merging those is not something I would want to tackle on a PR like this, however I think longer term there could be some benefits of being able to go back and forth between rpc and firehose where that is possible.

The notion of readiness is already encapsulated on the ProviderManager, when calling get_all, the caller already gets ready to use adapters so the rest of the code has no notion of readiness. The reason I decided to encapsulate this on a deeper layer instead of a decorator is because on firehose we turn 1 definition into a connection pool to avoid hitting some limits(can find the issue later but it's not relevant for this conversation why we do it). Hence, the idea of having a decorator approach would then cause the same check to happen 20 (default) times for each adapter instead of once per adapter. This is also similar to how we handle the connection failover on firehose and rpc for the same-ish reason.

If I understand correctly, the suggestion would be the expose the ProviderManager, essentially invert the layers in relation to where they are now. Ie

As a user of an adapter

fn do_something_with_firehose(pm: ProviderManager)  -> FirehoseEndpoint

pm.get_firehose()

instead of

fn do_something_with_firehose(fe: FirehoseEndpoints) -> FirehoseEndpoint

fe.endpoint()

I like that this unifies the types for adapters behind a single manager type, on the other hand this may have other implications like how it is implemented, right now it uses async and there are some cases where we can't use that (which is why I added some escape hatches). The fact it's a bigger interface is not great but it may open some possibilities to some behaviour we can't mock with the current implementation. My gut feeling is this is quite a large change but probably worth exploring further. Hopefully this is what you meant 😄

lutter · 2024-05-20T19:36:20Z

I like that this unifies the types for adapters behind a single manager type, on the other hand this may have other implications like how it is implemented, right now it uses async and there are some cases where we can't use that (which is why I added some escape hatches). The fact it's a bigger interface is not great but it may open some possibilities to some behaviour we can't mock with the current implementation. My gut feeling is this is quite a large change but probably worth exploring further. Hopefully this is what you meant 😄

I was a little unclear in my initial message: I didn't mean that we necessarily only have one Endpoint and ProviderManager struct - it might be that it's better to split that between Rpc and Firehose and have two different structs each. It would be great if we could unify this all the way, but there might be things we want from one that make no sense for the other. I was more driving at that it should be possible to encapsulate the logic around provider testing, failover etc. in something like the ProviderManager so that users of these do not need to know about how that happens.

mangas · 2024-05-21T10:54:36Z

I like that this unifies the types for adapters behind a single manager type, on the other hand this may have other implications like how it is implemented, right now it uses async and there are some cases where we can't use that (which is why I added some escape hatches). The fact it's a bigger interface is not great but it may open some possibilities to some behaviour we can't mock with the current implementation. My gut feeling is this is quite a large change but probably worth exploring further. Hopefully this is what you meant 😄

I was a little unclear in my initial message: I didn't mean that we necessarily only have one Endpoint and ProviderManager struct - it might be that it's better to split that between Rpc and Firehose and have two different structs each. It would be great if we could unify this all the way, but there might be things we want from one that make no sense for the other. I was more driving at that it should be possible to encapsulate the logic around provider testing, failover etc. in something like the ProviderManager so that users of these do not need to know about how that happens.

I agree we can have a similar approach for those and just remove them from the specific implementations. I don't think I will try to tackle that on this PR but happy to chase that separately 😛

#3937

mangas marked this pull request as draft April 11, 2024 09:09

mangas changed the title ~~Remove provider checks at startup~~ Remove provider checks at startup [WIP] Apr 11, 2024

mangas force-pushed the filipe/remove-start-provider-checks branch 3 times, most recently from 4d95fd8 to 64963c5 Compare April 15, 2024 16:20

mangas force-pushed the filipe/remove-start-provider-checks branch 2 times, most recently from b486f8a to 1c43770 Compare April 22, 2024 13:23

mangas force-pushed the filipe/remove-start-provider-checks branch 8 times, most recently from 9ad4249 to d6c76f8 Compare May 13, 2024 10:41

mangas marked this pull request as ready for review May 13, 2024 15:11

mangas force-pushed the filipe/remove-start-provider-checks branch from f15c7dd to 002def6 Compare May 13, 2024 15:20

mangas changed the title ~~Remove provider checks at startup [WIP]~~ Remove provider checks at startup May 13, 2024

mangas force-pushed the filipe/remove-start-provider-checks branch 3 times, most recently from cdfca0f to c4886a3 Compare May 14, 2024 09:32

mangas requested a review from lutter May 14, 2024 09:35

mangas force-pushed the filipe/remove-start-provider-checks branch from c4886a3 to c9784a8 Compare May 14, 2024 15:11

fordN requested review from incrypto32 and zorancv and removed request for lutter May 14, 2024 15:56

mangas force-pushed the filipe/remove-start-provider-checks branch from c828c7d to d3a0e68 Compare May 16, 2024 10:30

fordN requested review from lutter and removed request for incrypto32 May 16, 2024 15:41

mangas force-pushed the filipe/remove-start-provider-checks branch from 808fd97 to ec0306a Compare May 17, 2024 14:38

lutter reviewed May 18, 2024

View reviewed changes

mangas force-pushed the filipe/remove-start-provider-checks branch 3 times, most recently from 9671102 to 4beac8f Compare May 20, 2024 10:15

mangas force-pushed the filipe/remove-start-provider-checks branch from 4beac8f to 9174aa7 Compare May 22, 2024 11:42

mangas added 15 commits May 24, 2024 12:44

Remove provider checks at startup

b69e2b0

#3937

genesis decoder

7e08aed

substreams genesis

136dc7d

Fix polling block ingestor

353b5f5

Upper limit for adapter verification

1d23f16

update graphman run

89ae8b4

cleanup warnings

0b85f8c

fix tests

0bebc34

remove unused code

47dee33

update chain_store to load ident from db

0dae4b5

update chain ident when it was previously unset

206923d

add logging && fix substreams

4f34c4f

remove pub from firehoseendpoints

d73bec0

remove default and use sane TTL

fe45c26

move network setup to a different module

0c2cdbf

mangas force-pushed the filipe/remove-start-provider-checks branch from cac3a06 to b029424 Compare May 24, 2024 11:44

require chain ident to create chain store

4017d9c

mangas force-pushed the filipe/remove-start-provider-checks branch from b029424 to 4017d9c Compare May 24, 2024 13:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove provider checks at startup #5337

Remove provider checks at startup #5337

mangas commented Apr 11, 2024 •

edited

lutter left a comment •

edited

lutter May 18, 2024

mangas May 20, 2024 •

edited

mangas May 20, 2024

lutter May 20, 2024

lutter May 18, 2024

mangas May 20, 2024 •

edited

lutter May 18, 2024

lutter May 18, 2024

mangas May 20, 2024

lutter May 18, 2024

mangas May 20, 2024

lutter May 18, 2024

mangas May 20, 2024

lutter May 20, 2024

mangas May 22, 2024

lutter May 18, 2024

mangas May 21, 2024

lutter May 18, 2024

mangas May 20, 2024

lutter May 20, 2024

lutter May 18, 2024

mangas commented May 20, 2024

lutter commented May 20, 2024

mangas commented May 21, 2024


		use crate::components::store::{BlockStore as BlockStoreTrait, ChainStore as ChainStoreTrait};

		const VALIDATION_ATTEMPT_TTL_SECONDS: i64 = 60 * 5;

Remove provider checks at startup #5337

Are you sure you want to change the base?

Remove provider checks at startup #5337

Conversation

mangas commented Apr 11, 2024 • edited

lutter left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mangas May 20, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mangas May 20, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mangas commented May 20, 2024

lutter commented May 20, 2024

mangas commented May 21, 2024

mangas commented Apr 11, 2024 •

edited

lutter left a comment •

edited

mangas May 20, 2024 •

edited

mangas May 20, 2024 •

edited