Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

phf memory-mapped files? #103

Open
joshtriplett opened this issue Jan 27, 2017 · 8 comments
Open

phf memory-mapped files? #103

joshtriplett opened this issue Jan 27, 2017 · 8 comments

Comments

@joshtriplett
Copy link

I'd love to have phf generate at compile-time a file that it can then memory-map at runtime and use as an efficient perfect hash. That would support programs that want to efficiently store and check a large set of words, but want to allow updating that set at runtime without recompiling the program.

The usage in the build.rs script would look similar, except that instead of writing out a .rs file, it would write out a data file. phf (or some crate depending on it) could then provide a function to open that file, which would memory-map the file and use it as a perfect hash.

Does this seem reasonable?

@sfackler
Copy link
Collaborator

Yep, that definitely sounds reasonable!

@idubrov
Copy link

idubrov commented Feb 4, 2018

Do you think this would be a reasonable way to reduce compilation times for large maps (and help IDE by not showing it gigantic source file)?

@joshtriplett
Copy link
Author

@idubrov It wouldn't reduce the time to compile the map the first time, but in theory you could cache that map and reuse it as long as the data remains unchanged. And it'd avoid the time to run the result through rustc.

@newpavlov
Copy link

I would love to see it done as well! Currently I generate so files which I then load at runtime depending on the set which I want to search. It works, but going through FFI is... meh.

@abonander
Copy link
Collaborator

abonander commented Jul 5, 2019

This proposal is interesting and novel but there's a lot of questions to answer. Namely, if we should just take a pointer to a memory map and interpret it as a &phf::Map or whatever, because this poses a few problems to conquer:

  • How do we ensure we're interpreting the memory map correctly?
    • Do we stabilize the representation of phf::Map et al?
    • Do we just encode a version number with the memory map and return an error if it's incompatible?
    • What about composite types defined by the user or other crates?
    • How do we make sure the type parameters we interpret the map with are the same as the ones the map was created with?
    • Or, do we just make this an unsafe operation and lay this whole burden on the user?
  • How do we support dynamic-sized types like strings and byte slices? Some sort of wrapper type that uses a relative offset instead of a memory address?
  • How do we prevent concurrent modification of the mmapped file on Unix? File locking is only advisory. We could mark the file as read-only on creation, but we can't prevent someone from changing it back unless we run under a different user.

@novacrazy
Copy link

I'm by no means an expert on this, but my first instinct would be to design a protocol similar to Cap'n Proto, so packed and flattened in-place. Perhaps require a trait for key/value types that allows it to be packed and stored, somewhat like serde, which can be implemented on std types (including strings, vectors, etc.), and derived for user types. Maybe take inspiration from how serde handles zero-copy parsing so it could handle DSTs, too. However, packing the data too tightly may bring up issues with alignment...

The "schema" in phf's case may be simple and 1D enough to skip most of the complexity of things.

For versioning and type validation, it may be worth looking into how Cap'n Proto handles that, as well. Perhaps using the aforementioned trait to define some consistent type ids.

As for preventing concurrent modifications... I'm not sure there is anything that can be done with that on the level of phf. That would be up to the deployment systems, I think.

@abonander
Copy link
Collaborator

abonander commented Jul 5, 2019

Theoretically, we could just define CapnProto versions of the phf collections; CapnProto supports generics and in-memory access and capnproto-rust supports a handful of stdlib types. It'd be an interesting experiment.

@saona-raimundo
Copy link

Maybe I am totally wrong, but is the following an option?
A binary that works as a server answering queries to the map.
Other programs can start the server and interact with it through stdin/out.
Is there a lot of overhead with this approach for the use cases discussed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants