Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Doc/add use cases #215

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

[WIP] Doc/add use cases #215

wants to merge 8 commits into from

Conversation

ZenGround0
Copy link
Collaborator

@ZenGround0 ZenGround0 commented Oct 29, 2017

This is not close to done yet as most use case work is missing. @hsanjuan if you get a chance to look this over I'd like to know if

  1. I'm saying anything false or that you disagree with particularly in the first two sections
  2. If the first two sections belong in this document (probably not), or in this repo at all (maybe not). I can see that these sections might seem kind of off topic for use case descriptions. They are the result of my making sense of scattered historical information that has made the motivation of the project much clearer, and has gotten me up to speed on prior thought regarding ipfs-cluster user interfaces. In my head both UI and general motivation are strongly tied to use cases and writing this stuff helped me and might help others understand the project.
  3. Any input on the informal use case structure and how it could be improved

@ghost ghost assigned ZenGround0 Oct 29, 2017
@ghost ghost added the status/in-progress In progress label Oct 29, 2017
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.09%) to 74.805% when pulling a6b55c0 on doc/add-use-cases into e51f771 on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.02%) to 74.87% when pulling 7f5625a on doc/add-use-cases into e51f771 on master.

Copy link
Collaborator

@hsanjuan hsanjuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I think this is a good start. A use-case doc would probably be much longer and detailed, but we also need to decide which particular use-case are worth that effort (that'd be because we aim to make them happen). Until then it's great to gather all the pieces of information in one place like this.

Early discussion, again in https://github.com/ipfs/notes/issues/58, outlines a particular approach that remains relevant to discussion today: ipfs-cluster as a virtual ipfs node (vNode for short). The idea is that ipfs-cluster nodes could implement the ipfs api, only exposing state agreed upon by all nodes through consensus. A simple example: all nodes in the cluster agree on a single peer id, and after reaching agreement all nodes respond with this id to requests on their ipfs vNode id endpoint. A more useful example: an ipfs-cluster node gets an `add <file>` request through the ipfs vNode add endpoint, the cluster nodes coordinate adding this file, perhaps doing something clever like replicating file data across multiple nodes, and reach agreement that the data was added and pinned successfully via consensus. After all this occurs the api endpoint returns the normal message indicating success.

Designing ipfs-cluster to work this way has some benefits, including a familiar api for user interaction, the ability to use a cluster anywhere an ipfs node is used and the ability to make ipfs-clusters depend on other ipfs-clusters as the ipfs nodes that they coordinate. This last property has the potential to make scaling ipfs-cluster easier; if large groups of participants can be abstracted away consensus peer group size can remain bounded as cluster participants grow arbitrarily. It is not always the case that an ipfs api is the best user interface for adding files to ipfs-cluster. If ipfs-cluster were to support behavior like per-pin replication configuration information, for example different pins specifying different replication factors as it does today, then the ipfs api has no endpoint to encode this information and some kind of cluster specific interface is needed (See discussion https://botbot.me/freenode/ipfs/2017-02-09/ here for a somewhat related conversation that includes discussion of more use cases).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do support 'per-pin' replication configuration information. You can assign a different replication factor to every pin.

@ZenGround0
Copy link
Collaborator Author

@flyingzumwalt today during all hands you mentioned having come across potential use cases while coordinating with data-together. If you have any of these in writing, very brief descriptions are fine, please feel free to add them below in the comments so that I can include them in our aggregation and keep track. Thank you!

License: MIT
Signed-off-by: Wyatt Daviau <wdaviau@cs.stanford.edu>
@GitCop
Copy link

GitCop commented Dec 7, 2017

There were the following issues with your Pull Request

  • Commit: 52c1691

  • Invalid signoff. Commit message must end with
    License: MIT
    Signed-off-by:

  • Your subject line is longer than 80 characters

  • Commit: a6b55c0

  • Invalid signoff. Commit message must end with
    License: MIT
    Signed-off-by:

  • Commit: 7f5625a

  • Invalid signoff. Commit message must end with
    License: MIT
    Signed-off-by:

Guidelines are available at https://github.com/ipfs/ipfs-cluster/blob/master/contribute.md


This message was auto-generated by https://gitcop.com

License: MIT
Signed-off-by: Wyatt Daviau <wdaviau@cs.stanford.edu>
@GitCop
Copy link

GitCop commented Dec 18, 2017

There were the following issues with your Pull Request

  • Commit: 52c1691

  • Invalid signoff. Commit message must end with
    License: MIT
    Signed-off-by:

  • Your subject line is longer than 80 characters

  • Commit: a6b55c0

  • Invalid signoff. Commit message must end with
    License: MIT
    Signed-off-by:

  • Commit: 7f5625a

  • Invalid signoff. Commit message must end with
    License: MIT
    Signed-off-by:

Guidelines are available at https://github.com/ipfs/ipfs-cluster/blob/master/contribute.md


This message was auto-generated by https://gitcop.com

@ZenGround0 ZenGround0 changed the title [WIP] Doc/add use cases Highly WIP [WIP] Doc/add use cases Dec 18, 2017
@coveralls
Copy link

Coverage Status

Coverage decreased (-2.08%) to 72.817% when pulling 305b2fc on doc/add-use-cases into e51f771 on master.

License: MIT
Signed-off-by: Wyatt Daviau <wdaviau@cs.stanford.edu>
@GitCop
Copy link

GitCop commented Dec 18, 2017

There were the following issues with your Pull Request

  • Commit: 52c1691

  • Invalid signoff. Commit message must end with
    License: MIT
    Signed-off-by:

  • Your subject line is longer than 80 characters

  • Commit: a6b55c0

  • Invalid signoff. Commit message must end with
    License: MIT
    Signed-off-by:

  • Commit: 7f5625a

  • Invalid signoff. Commit message must end with
    License: MIT
    Signed-off-by:

Guidelines are available at https://github.com/ipfs/ipfs-cluster/blob/master/contribute.md


This message was auto-generated by https://gitcop.com

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.09%) to 72.799% when pulling 8b9f65f on doc/add-use-cases into e51f771 on master.

License: MIT
Signed-off-by: Wyatt Daviau <wdaviau@cs.stanford.edu>
License: MIT
Signed-off-by: Wyatt Daviau <wdaviau@cs.stanford.edu>
@GitCop
Copy link

GitCop commented Dec 22, 2017

There were the following issues with your Pull Request

  • Commit: 52c1691

  • Invalid signoff. Commit message must end with
    License: MIT
    Signed-off-by:

  • Your subject line is longer than 80 characters

  • Commit: a6b55c0

  • Invalid signoff. Commit message must end with
    License: MIT
    Signed-off-by:

  • Commit: 7f5625a

  • Invalid signoff. Commit message must end with
    License: MIT
    Signed-off-by:

Guidelines are available at https://github.com/ipfs/ipfs-cluster/blob/master/contribute.md


This message was auto-generated by https://gitcop.com

@ZenGround0
Copy link
Collaborator Author

@hsanjuan A lot of this is still pretty rough. It was nice to get some rough ideas down though. If you get a chance and have any interest in checking this out feedback is super welcome. Also feel free to add in the gateway stuff if you like.

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.08%) to 72.817% when pulling d2abac4 on doc/add-use-cases into e51f771 on master.

Copy link
Collaborator

@hsanjuan hsanjuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I have read through and added some thoughts. I think the main usecases are here.

The mirror (with large trees support), cdn, pinning rings are probably the most important ones that we'd need to really flesh out in terms of requirements that can be eventually translated into OKRs (maybe not for this quarter but for the future)


Description: An ipfs user with multiple machines wants to run their ipfs node with better replication or availability guarantees. The user creates an ipfs cluster across machines. Adding content to ipfs automatically triggers a pin in the cluster according to some predetermined replication strategy. The user advertises multiaddresses from all the ipfs daemons as multiaddresses of the ipfs node mirror.

Thoughts: This use case, like others, requires some mechanism for automatic pinning upon adding to ipfs. Depending on the adding mechanism this might be straightforward functionality to add to cluster, for example cluster's ipfs proxy add endpoint will probably eventually do this by default. However in some cases, such as if each machine writes to ipfs over the fuse interface, this would be more difficult. A mirrored node could potentially advertise itself as an ipfs node and advertise the cluster ipfs proxy endpoint addresses as its multiaddresses.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"will probably eventually do this by default" -> "does this".

There was some talk of go-ipfs providing websockets endpoint which could send events. This would provide a very nice way to be informed about pins/unpins in ipfs and do automatic mirroring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on an old thread -- do we do this now? Is this how Cluster works? -->

An IPFS user with multiple computers wants to make sure their content is always available on the IPFS network. That means they need to make sure their computer is always on and connected to the network, or that they’ve made copies on other computers so that if one goes down, there’s still copies of their content via other hard drives on the network.

Before IPFS Cluster, this user would have needed to do [some really annoying thing]. Now, this user can use Cluster to create a connected set of locations to store their content. When the user adds content to IPFS, Cluster automatically creates copies across the user’s computers according to their settings. Perhaps this user is very worried about losing access to this content, and so sets Cluster to make 100 copies; perhaps the user is less worried about consistent access, and so sets Cluster to make only one other copy.

As computers come online and offline, Cluster also takes care of maintaining the user’s replication strategy: the user might always want 100 copies of their content available, but the exact computers storing this information might change. There will, however, always be 100 copies available. People are able to find content because Cluster is [able to do something nifty around mirroring and multiaddresses that I need to understand still].

Is this right? (I'm hoping to slowly get a handle on how to describe what Cluster does so we can make a more plain-language version of our intro/high-level docs.)


These are some WIP ipfs-cluster use case sketches. These are not formal use cases and are more accurately groups of related use cases; they could be further decomposed into the more narrowly scoped operations found in formal use cases.

## ipfs node mirror
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a field that specifies a list of things needed to support this usecase, and which of those are already implemented. It does not need to be exhaustive, but give an overview.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, this is something I can work on

3. As a last example imagine a cluster serving as storage backing searching for potentially large queries over blockchains. For example say you want to look for all transactions that spend from the output transactions from block X. You could potentially use ipld selectors to query and pin all of the relevant hashes in an ipfs node, but what if there is too much data to reasonably fit on any one ipfs node's machine? A group of trusted nodes could be brought together to handle such queries as an ipfs-cluster and avoid running out of space for storing the results.

Thoughts:
1. To seriously support storing data for miners we would probably need to examine ipfs-cluster's latency profile more seriously. The above description does not specify how to prevent the set of pinned cids from ballooning quickly. This use case would require some kind of sharding of the transaction set in a way that does not track every transactions pin in the cluster shared state. Because new transactions would be added all the time this is not a simple application of basic sharding, which does one import of a huge file into shards. ipfs-cluster could address this with a strategy for incrementally updating shard membership. The state blow-up could also be mitigated if the cluster recursively (to a certain depth) pinned larger subdags dags, i.e. the hash of every X blocks. The security of go-ipfs would need to be vetted more thoroughly so that users could trust that hash lookups securely resolve to the correct data. In general this use case idea needs more domain specific knowledge. Is blockchain storage currently a pain point for mining operations that ipfs and cluster could address? How do popular mining clients (bitcoin-core, geth) handle blockchain storage? Would ipfs integration using ipfs-cluster for storing large merkle dags be possible/useful/welcome for these clients? What are their requirements (latency, security models, integration with existing tools)?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think ipfs or cluster are ever going to beat local blockchain storage and in-mem caches but I do wonder what's going to happen when a chain is "too big to store"

It's interesting that ipfs/cluster can nevertheless be used to distribute blockchain data to the rest of the network (new blocks, chain downloads etc). ipfs+ipld already supports ingesting bitcoin/eth blocks to ipfs. Would be awesome if the chain sync operation would just fetch stuff from ipfs to the chain database for a start.

Thoughts:
1. To seriously support storing data for miners we would probably need to examine ipfs-cluster's latency profile more seriously. The above description does not specify how to prevent the set of pinned cids from ballooning quickly. This use case would require some kind of sharding of the transaction set in a way that does not track every transactions pin in the cluster shared state. Because new transactions would be added all the time this is not a simple application of basic sharding, which does one import of a huge file into shards. ipfs-cluster could address this with a strategy for incrementally updating shard membership. The state blow-up could also be mitigated if the cluster recursively (to a certain depth) pinned larger subdags dags, i.e. the hash of every X blocks. The security of go-ipfs would need to be vetted more thoroughly so that users could trust that hash lookups securely resolve to the correct data. In general this use case idea needs more domain specific knowledge. Is blockchain storage currently a pain point for mining operations that ipfs and cluster could address? How do popular mining clients (bitcoin-core, geth) handle blockchain storage? Would ipfs integration using ipfs-cluster for storing large merkle dags be possible/useful/welcome for these clients? What are their requirements (latency, security models, integration with existing tools)?

2. For the second example you could imagine that the block explorer is a dApp with users running ipfs and caching blockchain data as they look it up and the ipfs-cluster acts as a permanent store for slower lookups (similar pattern to use case below). If this becomes a serious use case we should investigate pain points that currently exist for block explorer websites to get a better picture of how cluster would fit in.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!


2. For the second example you could imagine that the block explorer is a dApp with users running ipfs and caching blockchain data as they look it up and the ipfs-cluster acts as a permanent store for slower lookups (similar pattern to use case below). If this becomes a serious use case we should investigate pain points that currently exist for block explorer websites to get a better picture of how cluster would fit in.

3. This is a very undeveloped idea (ipld selectors don't exist yet, I haven't seen anyone ask for this) based on my impression that quick, expressive merkle dag searches would be valuable. I should investigate work that exists along these lines (ex, how do current block explorer's do queries over merkle trees?), and how ipld selectors compare as a next step.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, to do something like <ethhead>/block/*/transactions/*/from/<addr>/value

- support for dynamic cluster membership (exists today but potentially some bugs with lots of churn)
- some kind of trust modeling support, potentially including associating permissions to operations and assigning permissions to peers. Could make use of the proposed [capabilities service](https://github.com/ipfs/notes/issues/274).
- support for byzantine consensus protocols sounds relevant
- support for updating many uncoordinated nodes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem here is the trust model. I can think of several ways of approaching this usecase but I always find problems with how to prevent random users from altering the whole cluster. I think with Raft this is limited to who controls the cluster leader node, but Raft doesn't scale for this.

  • Can think of adding authorization to RPC calls
  • With the above, can think of only allowing trusted nodes to become leaders (Raft supports this)
  • Can think of implementing this as an application wrapping cluster/ipfs, in which the user does not run the cluster peer but only the ipfs-daemon, and the administrator runs an associated cluster peer but under his control. This would scale with composite clusters.
  • Can think of replacing the consensus layer with pubsub for this use-case, and only obeying updates signed with certain keys (perhaps for a pinning ring, the fact that the state is fully consistent among all peers rather quickly is not super important) (the more I think about this the nicer it sounds).

In any case, it is always the trust model that worries me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm a bit bored one day I'm going to write an RFC about replacing Raft with Pubsub

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will be super interested to read that RFC

- support for updating many uncoordinated nodes

Thoughts:
This is a particularly intersting use case as it requires more significantly new and unexplored functionality from cluster than any of the others here. As @hsanjuan mentioned in his original write up "The key here is to understand what the trust model is in a pinning ring, how members gain and lose trust, and who can take what actions". On a similar note a byzantine consensus protocol may greatly help keep the ring working smoothly even when some peers misbehave. This use case also presents challenges regarding how to get nodes to update when they are managed by different individuals. The current approach to updating requires all nodes to be shut down at the same tiem which may be unrealistic here.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh lol, I know see I had already written down stuff about trust model

Description: An admin wishes to set up an ipfs node storing a mirror of ubuntu deb packages to support an apt transport that downloads deb packages from ipfs. The admin wishes to download the packages and directory structure from one of the existing mirrors over http and store in ipfs. The admin does not have a server with the 2TB of storage necessary to host the entire mirror and so cannot fit all packages on a single ipfs node. However the admin does have access to a set of smaller servers (say 4 servers of 50GB) that together fulfill the total storage capacity of the mirror. The admin installs ipfs-cluster on each server and then commands the cluster to download the mirror data. During download the cluster allocates different pieces of the mirror directory to different machines, spreading load evenly. If there is extra space on the servers then replication of packages is a bonus, but this is not a primary concern for this usecase. The cluster hosting the mirror can be assumed stable, with a fixed number of servers all ran by a single admin or administrative body. After packages are added to the cluster users will fetch packages by path name from the root hash (QmAAA.../mirrorDir1/mirrorDir2/package.deb).

Implied ipfs-cluster requirements
- ipfs-cluster can handle importing, sharding and distributing across the cluster a file too big for one node, see PR #268.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be exact, it's a tree too big for one node. Files are rather small in the apt repository. The problem of distributing a huge file among several peers and the problem of distributing a huge archive of small files among several peers can be approached slightly different, even though we probably fix them the same way (because they're large trees in the end).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the clarification.

- ipfs-cluster provides configurable load balancing across ipfs-cluster nodes (WAN cluster and LAN subclusters have different balancers), that can handle frequent changes to peersets
- ipfs-clusters are easy to join and leave without having to think too much about set-up or generating errors
- ipfs-cluster allows for retrieval of data across sub clusters that are not necessarily connected, except as two subtrees of the same larger cluster.
ipfs-cluster allows individual peers to specify resource constraints
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is closely related to files api and unixfs/fuse right? It's like mounting the ipfs fuse filesystem in /home and using cluster to make sure that the contents on it are propagated to all other machines/users.

Copy link
Collaborator Author

@ZenGround0 ZenGround0 Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, when I pass through and list features that cluster requires I'll add things along these lines.


Designing ipfs-cluster to work this way has some benefits, including a familiar api for user interaction, the ability to use a cluster anywhere an ipfs node is used and the ability to make ipfs-clusters depend on other ipfs-clusters as the ipfs nodes that they coordinate. This last property has the potential to make scaling ipfs-cluster easier; if large groups of participants can be abstracted away consensus peer group size can remain bounded as cluster participants grow arbitrarily. It is not always the case that an ipfs api is the best user interface for adding files to ipfs-cluster. Ipfs-cluster's support for behavior like per-pin replication configuration information, for example the current feature that pins can specify different replication factors, has no direct analogue to characteristics of an ipfs node exposed over the ipfs api. As the ipfs api has no endpoint to encode this information, some kind of cluster specific interface is often useful, for example the current cluster `pin` command that allows setting replication factors.

Though an ipfs vNode api IS partially implemented in ipfs-cluster to the point that ipfs-clusters can be composed, only the subset of the vNode api that ipfs-cluster needs to call in order to function has received attention. There has been some discussion, in “Other open questions” https://github.com/hsanjuan/ipfsclusterspec/blob/master/README.md, about the difficulties involved in implementing a full ipfs vNode interface, and framing the vNode interface as a separate concern from the ipfs-cluster project’s primary goal of coordinating ipfs nodes. Today emphasis on implementing the vNode interface exists to the extent that it enables composition of ipfs-clusters, and further work may be revisited.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhap referene the composite cluster usecases PR. I kind of wanted to remove https://github.com/hsanjuan/ipfsclusterspec/blob/master/README.md because it's old and lacks context (I was asked to write it when I had little experience on ipfs/libp2p)

@hsanjuan
Copy link
Collaborator

@meiqimichelle I would like to close this. Can you check if we need to absorb any information somewhere else?

@meiqimichelle
Copy link
Contributor

I will check!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants