Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove provider checks at startup #5337

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

mangas
Copy link
Contributor

@mangas mangas commented Apr 11, 2024

Fixes #3937

  • New chain now created with ChainIdentifier::default() on start
  • Adapter genesis+net_version now checked by ProviderManager
  • Adapter checks will be disabled temporarily after an intermittent failure
  • Adapter won't be returned by the ProviderManager if the ident returned is not the expected value.
  • BlockIngestor now takes a ChainClient and tries to get an adapter with each poll, this means it won't stop if the first poll fails
  • node/src/main.rs refactored so that most of the init code is automatically executed by the new Networks type.
  • graphman run now uses the same init code as the node, should be able to run using any chain
  • EthereumNetworks type removed
  • FirehoseNetworks type removed
  • Update ident on first check from new

Future improvements:

  • Cache genesis once it's set
  • Cache chain ident once it's set

@mangas mangas marked this pull request as draft April 11, 2024 09:09
@mangas mangas changed the title Remove provider checks at startup Remove provider checks at startup [WIP] Apr 11, 2024
@mangas mangas force-pushed the filipe/remove-start-provider-checks branch 3 times, most recently from 4d95fd8 to 64963c5 Compare April 15, 2024 16:20
@mangas mangas force-pushed the filipe/remove-start-provider-checks branch 2 times, most recently from b486f8a to 1c43770 Compare April 22, 2024 13:23
@mangas mangas force-pushed the filipe/remove-start-provider-checks branch 8 times, most recently from 9ad4249 to d6c76f8 Compare May 13, 2024 10:41
@mangas mangas marked this pull request as ready for review May 13, 2024 15:11
@mangas mangas force-pushed the filipe/remove-start-provider-checks branch from f15c7dd to 002def6 Compare May 13, 2024 15:20
@mangas mangas changed the title Remove provider checks at startup [WIP] Remove provider checks at startup May 13, 2024
@mangas mangas force-pushed the filipe/remove-start-provider-checks branch 3 times, most recently from cdfca0f to c4886a3 Compare May 14, 2024 09:32
@mangas mangas requested a review from lutter May 14, 2024 09:35
@mangas mangas force-pushed the filipe/remove-start-provider-checks branch from c4886a3 to c9784a8 Compare May 14, 2024 15:11
@fordN fordN requested review from incrypto32 and zorancv and removed request for lutter May 14, 2024 15:56
@mangas mangas force-pushed the filipe/remove-start-provider-checks branch from c828c7d to d3a0e68 Compare May 16, 2024 10:30
@fordN fordN requested review from lutter and removed request for incrypto32 May 16, 2024 15:41
@mangas mangas force-pushed the filipe/remove-start-provider-checks branch from 808fd97 to ec0306a Compare May 17, 2024 14:38
Copy link
Collaborator

@lutter lutter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a full review (it's Friday evening and I am running out of time) but I wanted to provide some feedback for the PR. It feels like this could be simplified quite a bit by restructuring how it works and changing code so that it can rely on basic assumptions like 'the store is there and we can talk to it'

To elaborate a little more what I mean: my basic intuition is that whatever structs we use today to represent endpoints now become enums that roughly look like

enum Endpoint {
    NotReady(details)
   Ready(details)
   // .. maybe other states to help with bookkeeping
}

and methods on Endpoint that, for example, make a request, might become fallible, i.e. go from fn foo(&self) -> Something to fn foo(&self) -> Result<Something, Error> where foo then needs to check for the NotReady case if the endpoint looks ok, and either return an error if it doesn't or transition to the Ready case when it does. That's a little too simplistic since it involves mutation.

But most of these operations happen on a pool of endpoints, and the manager for that pool, like ProviderManager can mediate the mutation. Instead of calling methods directly on the endpoint, call it on the manager which can then call Endpoint.get_ready(self) -> Result<Self, Error> and hide the complexity of readiness checking and going from not ready -> ready from the outside world. In that world, you'd call foo on the provider manager, maybe with some hints on how to select an appropriate endpoint, and it would succeed if it could find or make an endpoint ready, and fail if none of them are ready.

manager: ProviderManager::default(),
call_only_adapters: vec![],
retest_percent: DEFAULT_ADAPTER_ERROR_RETEST_PERCENT,
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need all these default implementations? I think the original one was already a bad idea, and trying to keep up with that just forces us to have more and more basically meaningless default impls. I would always try to make sure that when a struct is constructed, it also has meaningful functionality.

Playing with the code a little, it seems that the associated Client type for Blockchain needs to be default because ChainClient::new_firehose needs a client to pass into ChainClient::new. I think this could all be avoided if new_firehose returned Result<Self,..> and just barfs if the firehose_available check fails - that avoids a misconfiguration causing us to have a dummy rpc endpoint

Copy link
Contributor Author

@mangas mangas May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually add these types of defaults when the setup is not super straightforward so that anyone using it doesn't have to dive into how the struct works to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out new was almost not used so I just removed it along with the default for EthereumNetworkAdapters

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. In general, it doesn't make sense to have a default implementation if it can't hold up the same assumptions as an object constructed 'the normal way'. When you've managed to construct an object, there should be assumptions that are always true. Otherwise, when you read code that uses that object you always have to wonder if this is a real one or just a dummy. But it seems in this case that's now moot.

// pub fn remove(&mut self, provider: &str) {
// self.manager.get_all_sync(&self.chain_id).unwrap_or_default()
// .retain(|adapter| adapter.adapter.provider() != provider);
// }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left over from testing?

Copy link
Contributor Author

@mangas mangas May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was left over from another layer around endpoints that was removed 😛

// .ok_or(anyhow!("network not supported: {}", &network_name))
// .and_then(|adapters| adapters.cheapest_with(requirements))
// }
// }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this code still needed?

@@ -510,18 +512,36 @@ impl BlockchainKind {

/// A collection of blockchains, keyed by `BlockchainKind` and network.
#[derive(Default, Debug, Clone)]
pub struct BlockchainMap(HashMap<(BlockchainKind, String), Arc<dyn Any + Send + Sync>>);
pub struct BlockchainMap(pub HashMap<(BlockchainKind, ChainId), Arc<dyn Any + Send + Sync>>);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make this pub? Could we get by with adding some accessor method to BlockchainMap instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added an iter accessor


use crate::components::store::{BlockStore as BlockStoreTrait, ChainStore as ChainStoreTrait};

const VALIDATION_ATTEMPT_TTL_SECONDS: i64 = 60 * 5;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better to make that a Duration. That'll also save you the _SECONDS postfix ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

.ok_or_else(|| IdentValidatorError::UnavailableStore(chain_id.clone()))?;
let store_ident = network_chain.chain_identifier();

if store_ident == &ChainIdentifier::default() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be something like store_ident.is_unset(), and would ideally be implemented by making ChainIdentifier an enum that has an Unset variant; or even better make ChainStore.chain_identifier return a Result<Option<ChainIdentifier>, Error>. Relying on the convention that the default means 'not really there' is very brittle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I went with this is basically because the store does not have a way of expressing "unset", it won't be null, since the field is not nullable. This means that to return the Option we'd just hide this assumption further down in the store. The main issue with that IMO is that the store should not know anything about set/unset unless that can be expressed in the data model.

While this is not strictly true because the store sets the default (since it isn't passed one) I feel like the better place fro all these decisions is in the IdentValidator which would be responsible for all of this. I agree relying on default values is a bit brittle but in this case is a complexity trade-off to avoid a migration in a somewhat critical place.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two options for addressing that:

  1. Change the data model to allow null in those columns. That then requires that all code handles the possibility that this is null
  2. Insist that for new chains we can actually talk to an endpoint to set these fields; if not the chain is ignored like it is today. I think that would be a reasonable requirement, that adding a new chain requires a bit more oversight. It's also not that big a deal if the new chain gets disabled as there can not be any subgraphs for it.

My preference would be (2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into it and I think it's needs to change a bit again because we will need remove the assumption one will always be returned and then expose a way of adding the store. So I think I will do that on a following PR, I feel like this implementation is ok enough as a first step I think it could be merge as is to make the follow update easier to review and get merged ?

rpc_provider_manager: ProviderManager::new(
Logger::root(Discard, o!()),
vec![].into_iter(),
Arc::new(MockIdentValidator),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this be the real validator? If there's no endpoints, the validator won't have to do anything. I don't like having all these traits floating around where whether they do anything or not depends on setup and you're never quite sure if the setup did things correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only implemented this for BlockStore and then the different test/mock implementations, I could add a Noop implementation but otherwise we don't have the necessary information to create it without arguments. This struct is not supposed to be mutated manually so the fact it it generated as noop is just intended for situations where it is expected to be used as noop/empty. Not sure yet another empty implementation is really useful, what do you think?

impl Config {
pub async fn networks(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module was really only meant for handling the configuration file and any checks that can be performed on the config file without any other setup. It feels like some of the headaches this needs to address is that, because Config has very little setup around it, the code needs to work both for when nothing has been set up and for when we actually have things like a store available.

It might be better to restructure this so that all this network setup code gets run from, say, a NetworkSetup struct that gets a config and whatever else is needed (like the BlockStore) and can then do its thing knowing that the store exists already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically what Networks is, I could move it to it's own module and add a from_config instead of having config.networks(), this was just for convenience, I don't feel strongly either way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think putting it into its own module would be better. It also seems to me that requiring having a BlockStore here would simplify the logic.

EthereumAdapterSelector, EthereumBlockRefetcher, EthereumRuntimeAdapterBuilder,
EthereumStreamBuilder,
};
use ethereum::{BlockIngestor, EthereumNetworks};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like that this tightens up main overall - it had way too much control flow mixed with intricate but somewhat boring detail.

@mangas mangas force-pushed the filipe/remove-start-provider-checks branch 3 times, most recently from 9671102 to 4beac8f Compare May 20, 2024 10:15
@mangas
Copy link
Contributor Author

mangas commented May 20, 2024

Not a full review (it's Friday evening and I am running out of time) but I wanted to provide some feedback for the PR. It feels like this could be simplified quite a bit by restructuring how it works and changing code so that it can rely on basic assumptions like 'the store is there and we can talk to it'

To elaborate a little more what I mean: my basic intuition is that whatever structs we use today to represent endpoints now become enums that roughly look like

enum Endpoint {
    NotReady(details)
   Ready(details)
   // .. maybe other states to help with bookkeeping
}

and methods on Endpoint that, for example, make a request, might become fallible, i.e. go from fn foo(&self) -> Something to fn foo(&self) -> Result<Something, Error> where foo then needs to check for the NotReady case if the endpoint looks ok, and either return an error if it doesn't or transition to the Ready case when it does. That's a little too simplistic since it involves mutation.

But most of these operations happen on a pool of endpoints, and the manager for that pool, like ProviderManager can mediate the mutation. Instead of calling methods directly on the endpoint, call it on the manager which can then call Endpoint.get_ready(self) -> Result<Self, Error> and hide the complexity of readiness checking and going from not ready -> ready from the outside world. In that world, you'd call foo on the provider manager, maybe with some hints on how to select an appropriate endpoint, and it would succeed if it could find or make an endpoint ready, and fail if none of them are ready.

Thanks for the review!

There are a couple of nuances on how we use the adapters, for example, we have different types for firehose and for rpc, one of these layers take care of stuff like falling over to a different adapter if one has a high error rate or handling a usage cap per adapter, these also handle some details like turning a single adapter into a pool and things like that.

These 2 types (firehose and rpc) have very different interfaces so I think merging those is not something I would want to tackle on a PR like this, however I think longer term there could be some benefits of being able to go back and forth between rpc and firehose where that is possible.

The notion of readiness is already encapsulated on the ProviderManager, when calling get_all, the caller already gets ready to use adapters so the rest of the code has no notion of readiness. The reason I decided to encapsulate this on a deeper layer instead of a decorator is because on firehose we turn 1 definition into a connection pool to avoid hitting some limits(can find the issue later but it's not relevant for this conversation why we do it). Hence, the idea of having a decorator approach would then cause the same check to happen 20 (default) times for each adapter instead of once per adapter. This is also similar to how we handle the connection failover on firehose and rpc for the same-ish reason.

If I understand correctly, the suggestion would be the expose the ProviderManager, essentially invert the layers in relation to where they are now. Ie

As a user of an adapter

fn do_something_with_firehose(pm: ProviderManager)  -> FirehoseEndpoint

pm.get_firehose()

instead of

fn do_something_with_firehose(fe: FirehoseEndpoints) -> FirehoseEndpoint

fe.endpoint()

I like that this unifies the types for adapters behind a single manager type, on the other hand this may have other implications like how it is implemented, right now it uses async and there are some cases where we can't use that (which is why I added some escape hatches). The fact it's a bigger interface is not great but it may open some possibilities to some behaviour we can't mock with the current implementation. My gut feeling is this is quite a large change but probably worth exploring further. Hopefully this is what you meant 😄

@lutter
Copy link
Collaborator

lutter commented May 20, 2024

I like that this unifies the types for adapters behind a single manager type, on the other hand this may have other implications like how it is implemented, right now it uses async and there are some cases where we can't use that (which is why I added some escape hatches). The fact it's a bigger interface is not great but it may open some possibilities to some behaviour we can't mock with the current implementation. My gut feeling is this is quite a large change but probably worth exploring further. Hopefully this is what you meant 😄

I was a little unclear in my initial message: I didn't mean that we necessarily only have one Endpoint and ProviderManager struct - it might be that it's better to split that between Rpc and Firehose and have two different structs each. It would be great if we could unify this all the way, but there might be things we want from one that make no sense for the other. I was more driving at that it should be possible to encapsulate the logic around provider testing, failover etc. in something like the ProviderManager so that users of these do not need to know about how that happens.

@mangas
Copy link
Contributor Author

mangas commented May 21, 2024

I like that this unifies the types for adapters behind a single manager type, on the other hand this may have other implications like how it is implemented, right now it uses async and there are some cases where we can't use that (which is why I added some escape hatches). The fact it's a bigger interface is not great but it may open some possibilities to some behaviour we can't mock with the current implementation. My gut feeling is this is quite a large change but probably worth exploring further. Hopefully this is what you meant 😄

I was a little unclear in my initial message: I didn't mean that we necessarily only have one Endpoint and ProviderManager struct - it might be that it's better to split that between Rpc and Firehose and have two different structs each. It would be great if we could unify this all the way, but there might be things we want from one that make no sense for the other. I was more driving at that it should be possible to encapsulate the logic around provider testing, failover etc. in something like the ProviderManager so that users of these do not need to know about how that happens.

I agree we can have a similar approach for those and just remove them from the specific implementations. I don't think I will try to tackle that on this PR but happy to chase that separately 😛

@mangas mangas force-pushed the filipe/remove-start-provider-checks branch from 4beac8f to 9174aa7 Compare May 22, 2024 11:42
@mangas mangas force-pushed the filipe/remove-start-provider-checks branch from cac3a06 to b029424 Compare May 24, 2024 11:44
@mangas mangas force-pushed the filipe/remove-start-provider-checks branch from b029424 to 4017d9c Compare May 24, 2024 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make graph-node tolerate chains not being available during startup
2 participants