Integrate Automatic Indexing into DB #1

CMCDragonkai · 2021-09-26T04:48:23Z

Specification

Indexing is the act of creating a separate data structure that makes it faster to lookup data on a primary data structure.

PK has several domains that require secondary indexing:

acl - perm id, vault id, node id
gestalts - discovery, adjacency list indexing
nodes - possibly needed in the future for all sorts of optimal routing needs (kadmelia indexing?)
notifications - to be able to filter notifications
sigchain - to be able to filter sigchain
vaults - for vault tagging and vault names

Right now secondary indexing is implemented manually in ACL by persisting bidirectional maps in leveldb.

Indexing is tricky topic, and it's better to create a single indexing mechanism in DB so that all domains in PK can benefit from them.

Generic indexing can reply on LevelDB's eventemitter interface. The good thing is that other people have already created good libraries for this:

https://github.com/hypermodules/level-auto-index - is compatible with subleveldown, and uses "hooks" to abstract the leveldb eventemitter interface
https://github.com/hypermodules/level-idx - this uses level-auto-index

The second library automates some of level-auto-index. In fact it binds secondary index interfaces to the leveldb object.

For @matrixai/db it makes more sense to use level-auto-index so we can more tightly control how the indexes are stored. Most likely via sublevel interfaces.

Additional context

Indexing Domains and DB Levels for Faster & Flexible Lookup Polykey#188 - further details on applying indexing to sigchain operations
http://www.tiernok.com/posts/adding-index-for-a-key-value-store/

Tasks

- Investigate possible dependencies to automate secondary indexing and their relationship
- Prototype the usage of level-auto-index
[ ] - Integrate the level-auto-index into this library's domain/level creation, each level including the root level should have the ability of attaching a secondary index (or multiple secondary indexes)

The text was updated successfully, but these errors were encountered:

CMCDragonkai · 2021-09-27T03:22:31Z

Was thinking that this secondary index is by default unique.

However non-unique keys can be done by making use of the sequence range.

CMCDragonkai · 2021-09-27T05:07:54Z

The level-idx automates level-auto-index by expecting the usage of a single sublevel as the index database. That sublevel is then further split into sublevels for each index that is created.

So for a structure like this:

ID -> {
  name: string;
  title: string;
}

The ID is the primary index, thus we don't need to index that. The name and title would be 2 separate secondary indexes. These would each occupy a sublevel under an index sublevel.

If we want to apply this to each sublevel we already have with the this.level(domain: string, dbLevel: DBLevel): Promise<DBLevel>, we may have an index sublevel for each sublevel we create, or a global index sublevel for all possible indexes.

The name and title become names of the index sublevels.

If we imagine there are:

aLevel

ID -> {
  name: string;
  content: string;
}

bLevel

ID -> {
  name: string;
  signature: string;
}

You can see that the name index would be conflicting. If we wanted to index the name for aLevel and bLevel, we would need to separate them out.

We could do this by making use of our domain array.

CMCDragonkai · 2021-09-27T05:34:28Z

Every sublevel created from DB.level can have an associated indexes sublevel. It can be automatically named by taking the domain string and appending domain.index. The . separator has to be different from the prefix separator in order to not confuse it with another sublevel. Perhaps maybe domain-index to avoid any confusion since I often use . to represent separators too.

This indexes sublevel is a sublevel containing all the index sublevels for respective primary db.

Not every sublevel requires an index though, so this can lazily initialised.

On the otherhand, creating a sublevel doesn't actually do anything to the database unless the sublevel is actually used. It can simplify things right now if it is just constructed.

The DB currently does not maintain any of the sublevel references. These are just given to the the user to make use of. See that DB.level returns a DBLevel. It is the user and all the PK domains that currently maintain this. If we keep this same pattern, then the PK domains would have to keep around the indexes. It would the be only way for users to be able to query with respect to an index.

Having the associated indexes sublevel is one thing, it is another to keep around the index itself. The index sublevel is in fact an implementation detail. It would be ideal for the indexes sublevel to not need to be maintained by the user. The user wants to use the index sublevel instead.

So here's a possible API:

const mgrDomain: DBDomain = ['Manager'];
const dirsDomain: DBDomain = [mgrDomain[0], 'dirs'];

const mgrDb = await db.level(mgrDomain[0]);
const dirsDb = await db.level(dirsDomain[1], mgrDb);

// this is is still a Db, but now used as an index
const mgrDbIndex = await db.index(mgrDb, 'indexname', ['name, 'title']);

// remember we can use the db themselves, but we don't benefit from automatic serialisation/deserialisation and encrypt/decrypt

// so if we want to "get" but according to an index, we now have to refer to the index domain
const indexDomain = ['Manager', 'indexname'];

// note that when you are creating a compound index, you now have to use a separator to indicate this
await db.get(indexDomain, 'somename!sometitle');

Remember you don't actually need to create a sublevel to use get, put... etc. The domain parameter basically ends up creating sublevels on the fly.

This means when you first create an index, and then start get and put it should work fine as the docs indicate that it automatically tracks these operations. However what happens if you already have data in the db, and you create the index afterwards? We can enforce the idea that the index should be created ahead of time, but it would be ideal that AutoIndex has a way of reconstructing the index from scratch if there was already data in the primary sublevel.

Another issue is the fact that the DB is now always using buffer encoding. So that means the index dbs should also be always using buffer encoding for values. But keys themselves depend.

A further problem is whether the indexes are going through the encryption/decryption? If they are not using the DB.get/put mechanisms, then the index values are not encrypted.

CMCDragonkai · 2021-09-27T05:35:57Z

Ok so there's a problem. Because the values are encrypted before storing into the DB. The hooks which relies on the event triggers only have access to the encrypted ciphertext. Thus the usage of level-auto-index is not possible because our encryption/decryption sits on top of the leveldb. And when the autoindex is listening on events, they are getting encrypted values.

Ok so if we want to do this, we would need to synthesise and extract the implementation into our own DB here rather than just using the upstream libraries.

CMCDragonkai · 2021-09-27T05:41:02Z

One issue with encrypting indexes, is that the index sublevel key is based on the primary sublevel value.

So if the primary sublevel name value is "sam".

Then would this mean that the index sublevel key is "sam"? If so, this would then be leaking the value itself.

If we would to encrypt the index keys this ends up with some problems...

Encryption would have to be deterministic. You are given sam and you must always get back the same ciphertext.
Sort order of the index keys no longer make sense.

Here's an alternative. An order-preserving hash. If the index keys can go through an order-preserving hash, then it should still work nicely preserving order, and you can deterministically hash any value you want to look up and then find the "encrypted" ID and decrypt it. It may even be possible to just hash the ID instead of encrypting it. The values of the index sublevels are just keys of the primary sublevels, and keys of primary sublevels are not encrypted anyway.

It makes sense to use order preserving hash on index keys because:

You don't want to leak the plaintext value of the primary level
You want to preserve the order of the value, so that range keys could be used, and iterating over the index key should be the same as if iterating over the value
Can be applied to compound indexes, by concatenating 2 or more order preserving hashes

Does it make sense to encrypt/decrypt the index values? - No (keep them plaintext)

The index values are just the keys of the primary level, the keys of the primary level are not encrypted
Therefore.. you should not need to encrypt the index values, or hash the index values either

CMCDragonkai · 2021-09-27T05:54:12Z

What kind of indexes do we want to support?

Unique value indexes, I'm not sure how the index works when multiple entries have the same value, does it figure out that the same entry can point to multiple keys?
Range indexes - need an iterator that counts it
Compound indexes - requires separator...

This gets pretty complicated. The most likely usage right now is just unique value keys, and possibly when it is not unique. If we end up with too many complicated indexes, we might as well change to using Sqlite instead.

Going to prototype this with level-auto-index to see how it deals non-unique value indexes.

CMCDragonkai · 2021-09-27T08:40:58Z

Order preserving hashing is not useful because you can recover the original value. https://crypto.stackexchange.com/questions/8160/secure-order-preserving-hash-function

I'm not sure if this applies to perfect minimal hashes. But it seems there is some leak.

There are new algorithms for order preserving encryption but no libraries in JS and too costly to implement for this feature.

However if we were to not use indexing, we would probably have to create bidirectional maps and end up not encrypting the key anyway. Therefore having an unencrypted index isn't any worse.

The main thing then is that anything we index is going to be leaked and thus indexing should be limited to values we are willing to leak like ID values. Things that are not meant to be leaked should not be indexed then.

So for now we can go with unencrypted & unhashed index keys. But index values are going to be encrypted as normal. A big fat warning is required to ensure Devs know that any indexed values are going to be plaintext.

The best solution would where the leveldb block system itself had encryption. This would mean all keys and values are encrypted and encryption is happening at the lower level. However this is not feasible to change to atm.

CMCDragonkai · 2021-09-29T12:58:42Z

It appears there's a relationship between proper transaction support, block level encryption and secure indexing.

Implement True Snapshot Isolation for LevelDB js-encryptedfs#49

In EFS we have an issue for integrating native snapshots into the transaction system. This requires going into the C++ binding of leveldb and making it work. Then our transactions can use leveldb snapshots natively. And in such a case our transaction could scale too.

Block level encryption means encryption at the blocks of the DB rather than at record level. RocksDB has this capability facebook/rocksdb#2424, but would require changing to use rocksdb which is possible because it fits the levelup API, but any encryption mechanism would be done at the C++ level, and not involve webcrypto APIs or primitives.

Indexing as in this issue is currently not entirely secure or is very limited in order to not leak values. It would work better if block encyption was available. Then we would not need to work out complicated OPE.

All 3 problems can be resolved by using sqlite3 and the sqlcipher extension assuming that it does block level encryption. However cross platform requirements need analysis. This also impacts the common crypto mechanisms used by PK.

Whether it is leveldb, rocksdb or sqlite, any robust situation will require calling into C++ and manipulating the C++ or C code. Although I wonder if the sqlitejs ports can use wasm and whether sqlcipher can be put through wasm.

CMCDragonkai · 2021-09-30T08:37:52Z

For now we will stick with simple indexing, and the current transaction support is sufficient.

When the combination of 3 problems becomes worthwhile, then we should investigate solutions for this DB to have encryption to be at block level, and then having proper transactions and proper indexing support. The first path would likely to be investigating the usage of SQLCipher. It's not an extension, it's a fork. It would need to be usable on PK CLI and PK GUI, and for Android & iOS it looks like it would need to be compiled for those platforms. This would be large epic, as it would involve multiple subissues to resolve. The encryption work would also need to involve consideration of WASM MatrixAI/Polykey#155 and QUIC MatrixAI/Polykey#234.

Basically for our next phase of work, there's going to be alot of C/C++ involved.

@joshuakarp

CMCDragonkai · 2021-10-11T09:04:31Z

Since MatrixAI/js-id#1 is done, this means we now have the ability to use IdDeterministic for doing any kind of hashing for hash based indexing.

CMCDragonkai · 2021-10-19T01:05:45Z

2 ways we can implement this.

1. Embedding into LevelDB Extension Points (hooks, events, abstract-leveldown wrappers): By trying to bring encryption into the leveldb via hooking into its event system, and thus getting a good idea of how we can apply indexing just before the encryption.

2. Building on top of DB here: By doing it above the existing encryption in our DB class.

The former requires diving into the leveldb internals, the latter might result in some code duplication and a recreation of an event system. The former doubles down on leveldb investment, the latter may be lighterweight and enable us to move to sqlite3 or other systems in the future.

Additionally, in order to represent an index that can point to multiple keys, a multi-value index that is, we can make use of sublevels again. Basically every time we index a value, that value doesn't directly point to the primary key, but instead to another sublevel of primary keys. The values of these can be arbitrary or contain metadata relating to each record. But one can now use the order of the primary key so you can get the "latest" value for a given index.

CMCDragonkai · 2022-02-16T03:09:05Z

Compound indexing is useful for our Discovery queue to represent priorities. MatrixAI/Polykey#329 (comment)

CMCDragonkai · 2022-02-16T03:17:10Z

2 ways we can implement this.

1. Embedding into LevelDB Extension Points (hooks, events, abstract-leveldown wrappers): By trying to bring encryption into the leveldb via hooking into its event system, and thus getting a good idea of how we can apply indexing just before the encryption.

2. Building on top of DB here: By doing it above the existing encryption in our DB class.

The former requires diving into the leveldb internals, the latter might result in some code duplication and a recreation of an event system. The former doubles down on leveldb investment, the latter may be lighterweight and enable us to move to sqlite3 or other systems in the future.

Additionally, in order to represent an index that can point to multiple keys, a multi-value index that is, we can make use of sublevels again. Basically every time we index a value, that value doesn't directly point to the primary key, but instead to another sublevel of primary keys. The values of these can be arbitrary or contain metadata relating to each record. But one can now use the order of the primary key so you can get the "latest" value for a given index.

Way 1. cannot be done unless #4 and #5 are solved first. As these are long and complicated problems, we'll park these until later, it will involve ultimately looking into the C++ codebase of leveldb and making changes there or at the level nodejs wrapper code.

Way 2. will be what we have to do to proceed for now. To be as foolproof as possible, we should wrap our sublevels as DB type as well.

CMCDragonkai · 2022-02-23T07:58:04Z

In MatrixAI/Polykey#244 we are storing each node entry in buckets. The buckets are ordered due to the lexi-int encoding of the bucket index, however the node ids themselves require to be indexed by the lastUpdated timestamp. This is to allow us to do fetches and streams in the order of last updated and to avoid having to sort the bucket every time we acquire it.

CMCDragonkai · 2022-03-23T08:18:15Z

With the merge of #10, this can now use a special root level called index to store all relevant index level data to avoid conflicting with normal data usage.

CMCDragonkai · 2022-05-04T02:54:10Z

Automatic indexing is made easier with snapshot isolation, because we can expect consistent reads, meaning while traversing an index, we can look up what the index points to easily.

CMCDragonkai · 2022-06-12T07:00:46Z

Since we're on rocksdb now, we can make use of some prior art for secondary indexing:

CMCDragonkai · 2022-11-06T09:37:05Z

There's too many ways we use indexing for it to be automated unless we first have a structured schema. So won't need this now since we have already done all the manual indexing work.

CMCDragonkai added the development Standard development label Sep 26, 2021

CMCDragonkai mentioned this issue Sep 26, 2021

Indexing Domains and DB Levels for Faster & Flexible Lookup MatrixAI/Polykey#188

Closed

CMCDragonkai self-assigned this Sep 26, 2021

CMCDragonkai added the enhancement New feature or request label Sep 26, 2021

CMCDragonkai added design Requires design (architecture, protocol, specification and task list requires further work) and removed development Standard development labels Sep 27, 2021

CMCDragonkai mentioned this issue Sep 28, 2021

WIP: Integrating Automatic Indexing into DB #2

Closed

8 tasks

CMCDragonkai mentioned this issue Oct 5, 2021

ID standardisation for all projects and domains MatrixAI/js-id#1

Closed

13 tasks

CMCDragonkai mentioned this issue Oct 19, 2021

Transparent Encryption/Decryption Layer between LevelDB and Filesystem "Block Manager" #5

Open

3 tasks

CMCDragonkai assigned tegefaulkes Oct 19, 2021

CMCDragonkai added development Standard development research Requires research labels Oct 20, 2021

joshuakarp mentioned this issue Oct 21, 2021

Integrate autoindexing into the vaults domain MatrixAI/Polykey#257

Closed

joshuakarp added epic Big issue with multiple subissues and removed epic Big issue with multiple subissues labels Nov 1, 2021

emmacasolin mentioned this issue Jan 10, 2022

Refactor Notifications Reading for Better CLI Usage MatrixAI/Polykey#315

Open

This was referenced Feb 16, 2022

Generic Non-Blocking Task Management ("Queue") for discovery and nodes domains MatrixAI/Polykey#329

Closed

Redesign usage of CDSS with VaultInternal MatrixAI/Polykey#305

Closed

CMCDragonkai assigned emmacasolin Feb 16, 2022

CMCDragonkai mentioned this issue Feb 23, 2022

Support read-after-write consistency for NodeGraph bucket operations MatrixAI/Polykey#244

Closed

CMCDragonkai added the epic Big issue with multiple subissues label Mar 11, 2022

This was referenced Mar 16, 2022

Enable the usage of withF and withG context for DB transactions and provide transactional iterators #10

Merged

Upgrade to abstract-level interface #11

Closed

CMCDragonkai mentioned this issue Mar 28, 2022

WIP: Upgrading @matrixai/async-init, @matrixai/async-locks, @matrixai/db, @matrixai/errors, @matrixai/workers, Node.js and integrating @matrixai/resources MatrixAI/Polykey#366

Closed

33 tasks

tegefaulkes mentioned this issue May 20, 2022

Upgrading lib dependencies and node.js version MatrixAI/Polykey#374

Merged

40 tasks

CMCDragonkai mentioned this issue Jul 14, 2022

Research how to simplify the connection between JS and C++ code (considering both ios and android) #37

Open

CMCDragonkai added r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management labels Jul 24, 2022

CMCDragonkai closed this as not planned Won't fix, can't repro, duplicate, stale Nov 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Automatic Indexing into DB #1

Integrate Automatic Indexing into DB #1

CMCDragonkai commented Sep 26, 2021 •

edited

CMCDragonkai commented Sep 27, 2021

CMCDragonkai commented Sep 27, 2021

CMCDragonkai commented Sep 27, 2021

CMCDragonkai commented Sep 27, 2021 •

edited

CMCDragonkai commented Sep 27, 2021 •

edited

CMCDragonkai commented Sep 27, 2021

CMCDragonkai commented Sep 27, 2021

CMCDragonkai commented Sep 29, 2021 •

edited

CMCDragonkai commented Sep 30, 2021

CMCDragonkai commented Oct 11, 2021

CMCDragonkai commented Oct 19, 2021 •

edited

CMCDragonkai commented Feb 16, 2022 •

edited

CMCDragonkai commented Feb 16, 2022 •

edited

CMCDragonkai commented Feb 23, 2022

CMCDragonkai commented Mar 23, 2022

CMCDragonkai commented May 4, 2022

CMCDragonkai commented Jun 12, 2022 •

edited

CMCDragonkai commented Nov 6, 2022 •

edited

Integrate Automatic Indexing into DB #1

Integrate Automatic Indexing into DB #1

Comments

CMCDragonkai commented Sep 26, 2021 • edited

Specification

Additional context

Tasks

CMCDragonkai commented Sep 27, 2021

CMCDragonkai commented Sep 27, 2021

CMCDragonkai commented Sep 27, 2021

CMCDragonkai commented Sep 27, 2021 • edited

CMCDragonkai commented Sep 27, 2021 • edited

CMCDragonkai commented Sep 27, 2021

CMCDragonkai commented Sep 27, 2021

CMCDragonkai commented Sep 29, 2021 • edited

CMCDragonkai commented Sep 30, 2021

CMCDragonkai commented Oct 11, 2021

CMCDragonkai commented Oct 19, 2021 • edited

CMCDragonkai commented Feb 16, 2022 • edited

CMCDragonkai commented Feb 16, 2022 • edited

CMCDragonkai commented Feb 23, 2022

CMCDragonkai commented Mar 23, 2022

CMCDragonkai commented May 4, 2022

CMCDragonkai commented Jun 12, 2022 • edited

CMCDragonkai commented Nov 6, 2022 • edited

CMCDragonkai commented Sep 26, 2021 •

edited

CMCDragonkai commented Sep 27, 2021 •

edited

CMCDragonkai commented Sep 27, 2021 •

edited

CMCDragonkai commented Sep 29, 2021 •

edited

CMCDragonkai commented Oct 19, 2021 •

edited

CMCDragonkai commented Feb 16, 2022 •

edited

CMCDragonkai commented Feb 16, 2022 •

edited

CMCDragonkai commented Jun 12, 2022 •

edited

CMCDragonkai commented Nov 6, 2022 •

edited