Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate Record Persistence Logic #867

Open
dennis-tra opened this issue Aug 21, 2023 · 4 comments
Open

Consolidate Record Persistence Logic #867

dennis-tra opened this issue Aug 21, 2023 · 4 comments

Comments

@dennis-tra
Copy link
Contributor

dennis-tra commented Aug 21, 2023

Right now, the IpfsDHT takes two datastores. This one that handles PUT_VALUE and GET_VALUE requests and this one just for provider records wrapped in a ProviderManager.

In an effort to Remove special cases for Peer and Provider records, I thought that there must be a way to unify things and find better abstractions.

One thing I need to mention first: It is not entirely clear to me why the ProviderManager is built with this single loop that selects on 1) additions of provider records 2) asks for provide records 3) garbage collection query results 4) garbage collection trigger. Tracing the git history shows that this design dates back to 2014. Given a thread-safe datastore, I think it would be more performant and easier to reason about if AddProvider and GetProviders would just write and read directly from the underlying datastore. Garbage collection would happen on a schedule in parallel.

However, a garbage collection run would race new provider record writes. If I read the current code correctly, this isn't a regression. I think this could be discussed in a separate issue.


Proposal

We could let the IpfsDHT expose a method to register record datastores:

func (dht *IpfsDHT) RegisterDatastore(dstore mount.Mount) { }

In the case of the IPFS DHT we would register datastores for the pk, ipns, and providers prefixes. The underlying datastores for all of them could be identical but don't have to be.

In NewIpfsDHT, we would construct a single mount.Datastore that all writes and reads are going through.

GET_VALUE/PUT_VALUE requests will read/write from the common mount datastore and based on the key's namespace prefix use the correct underlying one. The currently supported namespaces are pk and ipns. With the "mounted" approach, both RPCs would also start to automatically support provider records if the key in the request had the /providers/ prefix and the binary record field had the correct format. In my head, this is one step toward #584.

ADD_PROVIDER/GET_PROVIDERS requests would prefix the key with /providers/ and read/write from/to the common mount datastore as well.

This would allow us to remove the enableProviders and enableValues flags. If there's no datastore for the providers prefix we wouldn't handle ADD_PROVIDER/GET_PROVIDERS requests. If there's no datastore at all, we wouldn't handle *_VALUE/*_PROVIDERS requests.


However, we're handling PK/IPNS and provider records differently.

When adding PK/IPNS records, we fetch any existing record from disk, validate that existing record (e.g. it might be expired), select the incoming or existing one, and potentially store the incoming one (if we deem it "better").

When adding provider records, we write them to disk and add peer addresses to our Peerstore - no validation of existing data.

My idea would be to provide ValueStore and ProviderStore implementations that implement the ds.Datastore interface which can be used as datastores for the prefix mounts.

The ProviderStore would be similar to the ProviderManager (but directly call the underlying datastore). E.g., it could also use the LRU cache, would implement garbage collection logic, and have a reference to the peerstore to store addresses.

The ValueStore would receive, among other things, a record.Validator and encapsulate the logic from above (fetching existing record, selecting the "better" one). This means we wouldn't need the namespaced validator on IpfsDHT.


The above structure would allow users of go-libp2p-kad-dht to inject capabilities to support new record types by registering their own datastore for a certain prefix. The GET_VALUE/PUT_VALUE RPCs would then support it.


Concerns

I'm not sure how I feel about moving the logic of how to handle a certain record type into the persistence layer.

cc @aschmahmann

@guillaumemichel
Copy link
Contributor

Great writeup @dennis-tra !

Another question is: do we need to keep storing pk or can we simply drop it?

@aschmahmann
Copy link
Contributor

Overall this seems reasonable to me. You could also try and decouple the storage from the "how do I merge records" logic instead of asking the caller to create this. That is likely going to be an implementation detail to many consumers of this library so either way you'd end up creating helper functions here to do the combining work so it's mostly a matter of if you expose that via the constructor or not which is largely a matter of taste 🤷.

This would allow us to remove the enableProviders and enableValues flags. If there's no datastore for the providers prefix we wouldn't handle ADD_PROVIDER/GET_PROVIDERS requests.

Using the mount datastore won't let you do this, although probably you should ask callers to pass datastores for each record type rather than having them predefine a mount with all of them. Also, unless there's going to be a bunch of breakages + utilities for creating an /ipfs/kad/1.0.0 DHT vs others could you leave in the existing checks that stop people from accidentally creating /ipfs/kad/1.0.0 DHTs that are non-compliant that would be great (that includes disabling various record types).

do we need to keep storing pk or can we simply drop it?

You'd need to double check, my guess is that it's unused by /ipfs/kad/1.0.0 since IPNS implementations now tend to ship the public keys in the records, but I don't know for sure since kubo isn't the only user of that network (although it's the most used). IIUC we already have some measurement infra to figure out the number of provider records a node stores, seems like extending that for public keys would be relatively straightforward.

@dennis-tra
Copy link
Contributor Author

dennis-tra commented Aug 23, 2023

Thanks @aschmahmann! I experimented with an abstraction that I called "Backend" for now. This is a bit different from what I laid out in the first post. The code is all in PR #864.

It's a small interface that can be implemented to support additional record types for PUT_VALUE/GET_VALUE RPCs. There are three "reference" Backend implementations for the ipns, pk, and providers namespaces/record types.

I was also able to enforce support for the above three record types of the DHT uses the /ipfs/kad/1.0.0 protocol.

@guillaumemichel
Copy link
Contributor

The Backend abstraction is awesome @dennis-tra !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants