Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blog Post: DHT #244

Open
2 tasks
pstan26 opened this issue Mar 16, 2017 · 3 comments
Open
2 tasks

Blog Post: DHT #244

pstan26 opened this issue Mar 16, 2017 · 3 comments
Assignees

Comments

@pstan26
Copy link
Contributor

pstan26 commented Mar 16, 2017

Sounds like @jcnelson is 60% done -- go Jude!

Once we have it completed, let's plan to:

  • get public reviews from P2P experts (Frans etc)
  • Request comment from Mike Freedman, potentially Petar, and others

Goal here is to create one post to clarify fundamental limitations of DHTs. We need to get reviews and request some lines be made public. There are many decentralized storage systems looking for a solution -- the fundamental DHT issues apply to any project using them.

Maybe we can add bullets on the post here or link to the blog post, @jcnelson ? Is it reasonable to aim for April 10th release?

@jcnelson
Copy link
Member

jcnelson commented Mar 16, 2017

Here's what I have so far:


One of the more interesting innovations in Blockstack is in
how it distributes the set of zone files to all peers.
Like several other decentralized Web infrastructure projects, the original system used a DHT
(distributed hash table). This was acceptable for building a prototype implementation while the network
was small and mostly-reliable. However, this would have been unacceptable in the long-term due to the
stability and security problems inherent to DHTs. Since November 2016, we have been distributing zone
files using a decentralized storage system built from first principles called the Atlas Network.

DHTs: Unsafe at any Speed

(This was previously discussed here,
and in the middle of this blog post).

DHTs came into being in the 1990's and 2000's, and were among the first
scalable decentralized data stores. The most widely-used DHTs power file-sharing
software like BitTorrent and eMule. These are based on the Kademlia
DHT
, and enjoy widespread usage today.

Easy to Misuse

Since the mid 2000's, the distributed systems research community has learned several
hard but important lessons about the usability of DHTs. In particular, DHTs are
only useful if all of the following are true for your application:

  • You can't store a 100% replica. Your application stores so much data that
    it is unreasonable to try to keep a full copy on any single node.

  • You will tolerate arbitrarily slow I/O. Due to the way DHTs handle
    requests, reads and writes have very variable latency. Some incrmenetal
    work on DHTs (like
    DSHTs) try to
    fix this for frequently-requested data, but the long-tail performance is still
    terrible. This is inherent to DHT routing protocols; a pathological
    request can needlessly bounce through several high-latency network links
    before being handled.

  • You will periodically re-announce data. Public-access DHTs allow
    anyone to write to them. To deal with so much data, they simply delete stale
    data to relieve memory pressure. At the same time, nodes can come and go as
    they please, which can cause key/value pairs to get dropped. To deal with this,
    applications need to re-replicate data every so often to keep it available.

  • You will never change a key/value pair. DHTs can split into one or more disjoint
    networks due to partitions, and re-join later on. While split, a client can
    write conflicting key/value pairs to nodes on either side of the partition.
    This leads to externally-visible inconsistency--some clients will see one value
    for the key, and other clients will see a different value. To avoid this, the
    application must ensure that there is at most one value per key (i.e. data is
    immutable once written).

These are difficult demands to meet. Each application process effectively needs
to be online 24/7 to ensure its data is available, and even then the developer needs to
adress data consistency and I/O performance via some other way.

The reason BitTorrent and eMule can get
away with using DHTs is because (1) they use them only for "soft state," and (2)
they do not rely on them for correctness. Specifically, they use a DHT to
store tracking and routing information for a particular file.
This information is small by comparison to the
file, and can be re-created quickly by any peer. There are no consistency
concerns since since the .torrent file contains all the chunk hashes, and
there are no negative consequences of
receiving invalid tracking data since the hashes are known to all peers
beforehand. Peers also do not share data via the DHT, but connect to one
another directly.

Hard to Secure

The difficulties do not end there, though. DHTs are still vulnerable to three
types of attacks that can harm even the most resilient DHT-powered application.

  • Endless Data DDoS. Without some rate-limiting or access-control mechanism, a DHT has no way
    to limit the amount of data inserted. An adversary can flood the DHT with lots
    of garbage data and knock nodes offline. (This never applied to Blockstack,
    since our DHT only accepted key/value pairs whose hashes were written to the
    Bitcoin blockchain).

  • Hash-Bucket Takeover. By joining with the right nodes in a DHT, a Sybil
    node can take over responsibility for specific hash buckets in the network. In
    doing so, the Sybil will both be able to see who's asking for the data (a
    privacy concern), and will be able to censor the data by ignoring or slowing
    down requests (an availability concern). A Sybil node can also do extensive
    damage to a running DHT by repeatedly joining it, taking over heavily-requested
    buckets, and then dropping them.

  • Route-table Takeover. Suppose that the DHT let the developer "pin"
    hash buckets to honest nodes to thwart hash-bucket
    takeovers. This does not fix the availability problem, since a Sybil can insert itself
    as the immediate neighbors of the honest node and prevent the rest of the
    system from discovering it. This works because DHT nodes do not have a global
    view of which nodes have which buckets. Instead, each node keeps a short list of
    which nodes host buckets that are "close to" its buckets in the key space.
    If the Sybil node can become an honest node's neighbors, then it can prevent
    other nodes from discovering the honest node's data.

@muneeb-ali
Copy link
Member

Thanks for posting Jude! I have a bunch of (private) notes on this and can take a crack at expanding the draft.

@jackzampolin
Copy link
Contributor

Doesn't look like this made it into a blog post. @jcnelson @pstan26 Would this still be useful?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants