Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encrypt and compress read data #8

Open
morungos opened this issue May 29, 2014 · 2 comments
Open

Encrypt and compress read data #8

morungos opened this issue May 29, 2014 · 2 comments
Assignees

Comments

@morungos
Copy link
Member

Currently, both the pipeline and the webapp use reads at the record level. This is good for fine-grained access, but not ideal. We really should move to a bucketed/compressed/encrypted model, with (say) packets of 5k reads compressed and encrypted.

If we keep this relatively small, there won't be huge penalty accessing a single read. There may even be a performance improvement as we reduce the disk, I/O, and index sizes, which is actually likely.

This issue affects both the pipeline and the webapp, as the pipeline writes the data and the webapp reads it. So both Python and Java need to agree on data storage and compression systems. See: capsid/capsid-webapp#63

@morungos
Copy link
Member Author

Probably we need to use something as standard as deflate. We also care about speed almost as much as compression, which is why better techniques such as LZMA aren't really an option. This posting covers a similar issue, involving C# and Python. http://stackoverflow.com/questions/1089662/python-inflate-and-deflate-implementations

@morungos
Copy link
Member Author

morungos commented Jun 2, 2014

This is going to take a little work, because right now we just dump everything into the database. We can't really do that any more, so the process ought to change to use files more, and then load files into blocks in the database which can be encrypted and compressed. The basic idea is the same: there'll be files associated with (a) a genome, (b) a set of owners (project, align, etc.), and (c) indexed by start position in blocks of a decent size, say 30-50K, which can be quickly decrypted/decompressed. All this will happen in the webapp, but the pipeline needs to write out the storage. GridFS will handle most of it, when we know what we are writing.

The issue is that the pipeline can afford to just dump shit out. We can't do that. We need to build up blocks we can handle. Note that we do not need to make the start offsets consistent and sequentially ordered, we can make them a uniform read count, or block size, if we like. That does mean, however, we need the reads sorted by start by the time we get them. The owner issue is less of an issue: we're combining multiple owners into a single DB file field. Annoyingly. we do this using an update process, so it's not trivial to manage it all sequentially.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant