phf memory-mapped files? #103

joshtriplett · 2017-01-27T06:00:16Z

I'd love to have phf generate at compile-time a file that it can then memory-map at runtime and use as an efficient perfect hash. That would support programs that want to efficiently store and check a large set of words, but want to allow updating that set at runtime without recompiling the program.

The usage in the build.rs script would look similar, except that instead of writing out a .rs file, it would write out a data file. phf (or some crate depending on it) could then provide a function to open that file, which would memory-map the file and use it as a perfect hash.

Does this seem reasonable?

sfackler · 2017-01-27T06:01:06Z

Yep, that definitely sounds reasonable!

idubrov · 2018-02-04T16:50:03Z

Do you think this would be a reasonable way to reduce compilation times for large maps (and help IDE by not showing it gigantic source file)?

joshtriplett · 2018-02-04T17:10:14Z

@idubrov It wouldn't reduce the time to compile the map the first time, but in theory you could cache that map and reuse it as long as the data remains unchanged. And it'd avoid the time to run the result through rustc.

newpavlov · 2019-03-27T13:21:34Z

I would love to see it done as well! Currently I generate so files which I then load at runtime depending on the set which I want to search. It works, but going through FFI is... meh.

abonander · 2019-07-05T02:52:45Z

This proposal is interesting and novel but there's a lot of questions to answer. Namely, if we should just take a pointer to a memory map and interpret it as a &phf::Map or whatever, because this poses a few problems to conquer:

How do we ensure we're interpreting the memory map correctly?
- Do we stabilize the representation of phf::Map et al?
- Do we just encode a version number with the memory map and return an error if it's incompatible?
- What about composite types defined by the user or other crates?
- How do we make sure the type parameters we interpret the map with are the same as the ones the map was created with?
- Or, do we just make this an unsafe operation and lay this whole burden on the user?
How do we support dynamic-sized types like strings and byte slices? Some sort of wrapper type that uses a relative offset instead of a memory address?
How do we prevent concurrent modification of the mmapped file on Unix? File locking is only advisory. We could mark the file as read-only on creation, but we can't prevent someone from changing it back unless we run under a different user.

novacrazy · 2019-07-05T04:50:07Z

I'm by no means an expert on this, but my first instinct would be to design a protocol similar to Cap'n Proto, so packed and flattened in-place. Perhaps require a trait for key/value types that allows it to be packed and stored, somewhat like serde, which can be implemented on std types (including strings, vectors, etc.), and derived for user types. Maybe take inspiration from how serde handles zero-copy parsing so it could handle DSTs, too. However, packing the data too tightly may bring up issues with alignment...

The "schema" in phf's case may be simple and 1D enough to skip most of the complexity of things.

For versioning and type validation, it may be worth looking into how Cap'n Proto handles that, as well. Perhaps using the aforementioned trait to define some consistent type ids.

As for preventing concurrent modifications... I'm not sure there is anything that can be done with that on the level of phf. That would be up to the deployment systems, I think.

abonander · 2019-07-05T20:35:01Z

Theoretically, we could just define CapnProto versions of the phf collections; CapnProto supports generics and in-memory access and capnproto-rust supports a handful of stdlib types. It'd be an interesting experiment.

saona-raimundo · 2023-10-26T12:12:50Z

Maybe I am totally wrong, but is the following an option?
A binary that works as a server answering queries to the map.
Other programs can start the server and interact with it through stdin/out.
Is there a lot of overhead with this approach for the use cases discussed?

JohnTitor added the C-enhancement label Jun 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

phf memory-mapped files? #103

phf memory-mapped files? #103

joshtriplett commented Jan 27, 2017

sfackler commented Jan 27, 2017

idubrov commented Feb 4, 2018 •

edited

joshtriplett commented Feb 4, 2018

newpavlov commented Mar 27, 2019

abonander commented Jul 5, 2019 •

edited

novacrazy commented Jul 5, 2019

abonander commented Jul 5, 2019 •

edited

saona-raimundo commented Oct 26, 2023

phf memory-mapped files? #103

phf memory-mapped files? #103

Comments

joshtriplett commented Jan 27, 2017

sfackler commented Jan 27, 2017

idubrov commented Feb 4, 2018 • edited

joshtriplett commented Feb 4, 2018

newpavlov commented Mar 27, 2019

abonander commented Jul 5, 2019 • edited

novacrazy commented Jul 5, 2019

abonander commented Jul 5, 2019 • edited

saona-raimundo commented Oct 26, 2023

idubrov commented Feb 4, 2018 •

edited

abonander commented Jul 5, 2019 •

edited

abonander commented Jul 5, 2019 •

edited