Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom pointer type support ? #87

Open
kleunen opened this issue Mar 27, 2021 · 15 comments
Open

Custom pointer type support ? #87

kleunen opened this issue Mar 27, 2021 · 15 comments

Comments

@kleunen
Copy link

kleunen commented Mar 27, 2021

It seems custom pointer types are not supported with the parallel hashmap:
https://github.com/greg7mdp/parallel-hashmap/blob/master/parallel_hashmap/phmap.h#L843-L846

Would this be difficult to add ? Because I would like to apply the parallel hashmap to a project to process openstreetmap data. This involves loading several billions of geometrical data. We do this by loading the the data into a boost unordered_map backed by a boost interprocess mmap file. This way, the data is hashed but backed by a file on filesystem. This is required, because converting the planet actually involved processing 60G of compressed data.

The loading process is now completely single threaded, but ideally, we would like to load the node/ways/relations store with openstreetmap from the planet osm file in parallel. Parallel hashmap provides an interesting approach for this, but this approach would require using the boost interprocess offset pointer.

https://github.com/systemed/tilemaker

Is there a particular reason why custom pointer types are not allowed ?

@greg7mdp
Copy link
Owner

This is a restriction I inherited from the original Abseil source code. Can you tell me what kind of pointer type you would like to use?

@kleunen
Copy link
Author

kleunen commented Mar 27, 2021

Boost Interprocess Offset pointer:
https://www.boost.org/doc/libs/1_75_0/doc/html/interprocess/offset_ptr.html

Please observe the following:
One of the big problems of offset_ptr is the representation of the null pointer. The null pointer can't be safely represented like an offset, since the absolute address 0 is always outside of the mapped region. Due to the fact that the segment can be mapped in a different base address in each process the distance between the address 0 and offset_ptr is different for every process.

Some implementations choose the offset 0 (that is, an offset_ptr pointing to itself) as the null pointer pointer representation but this is not valid for many use cases since many times structures like linked lists or nodes from STL containers point to themselves (the end node in an empty container, for example) and 0 offset value is needed. An alternative is to store, in addition to the offset, a boolean to indicate if the pointer is null. However, this increments the size of the pointer and hurts performance.

In consequence, offset_ptr defines offset 1 as the null pointer, meaning that this class can't point to the byte after its own this pointer:

@greg7mdp
Copy link
Owner

Thanks! Maybe it is doable to support this. I'm wondering how I could test it. Would you have a small example with a boost unordered_map/set using a custom pointer type allocator?

@kleunen
Copy link
Author

kleunen commented Mar 27, 2021

Yes, i can make small example

@kleunen
Copy link
Author

kleunen commented Mar 27, 2021

This is the easy case:
https://wandbox.org/permlink/uFzkywyr2StzUq4i

But we actually need scoped allocation with piecewise construction as well, i can add example of this later.

@greg7mdp
Copy link
Owner

Great, thanks @kleunen, let me look at that first!

@kleunen
Copy link
Author

kleunen commented Mar 27, 2021

This is with a scoped allocation example:
https://wandbox.org/permlink/08K4Pd3wVUcscwJG

Vector in map using a scoped allocator

@greg7mdp
Copy link
Owner

Thanks. Please give me a couple days to look into it.

@kleunen
Copy link
Author

kleunen commented Mar 27, 2021

I cleaned up the example a little bit more:
https://wandbox.org/permlink/21FjLvAenx9j76P9

@kleunen
Copy link
Author

kleunen commented May 7, 2021

Hello @greg7mdp, do you think this is possible ? Or are there any show-stopping issues ?

@greg7mdp
Copy link
Owner

greg7mdp commented May 7, 2021

Hum, I'm not sure. I did try your example with a change in phmap but I got a very strange error. I need to look more into it.

image

@kleunen
Copy link
Author

kleunen commented May 7, 2021

What was the error you got ?

@kleunen
Copy link
Author

kleunen commented May 29, 2021

I actually worked around the issue now, by implementing a custom allocator, which keeps the mmap region open, even after resizing. This way, the pointers don't need to be offset pointers, because the old region also stayed mapped.

I do still think it would be useful if parallel-hashmap supports custom pointers. I think you should consistently use Alloctor::pointer and not convert to raw pointers.

@greg7mdp
Copy link
Owner

I think you should consistently use Alloctor::pointer and not convert to raw pointers.

How can I do that since I allocate the hashmap array itself using the allocator and need to access the entries into this array?

@kleunen
Copy link
Author

kleunen commented May 29, 2021

You don't actually have to manage the array yourself, you can use std::vector to allocate the array of pairs for you. You can make the actual storage container a template argument:

using entry_t = std::pair<int, int>;
using entry_allocator_t = std::allocator< std::pair<int, int> >;

// Storing entries
template<class T, class A>
using bucket_storage_t = std::vector<T, A>;

int main()
{   
    bucket_storage_t<entry_t, entry_allocator_t> entries(10);
    for(int i = 0; i < 10; ++i)
        entries[i] = std::make_pair(i, i); 
}

In the cases of the parallel hash map, you also need a storage type and allocator for the vector of vectors:

#include <utility>
#include <vector>

// Allocate entries 
using entry_t = std::pair<int, int>;
using entry_allocator_t = std::allocator< std::pair<int, int> >;
using bucked_list_allocator_t = std::allocator< std::vector< std::pair<int, int> > >;

// Storing entries
template<class T, class A>
using bucket_storage_t = std::vector<T, A>;

// Storing list of buckets
template<class T, class A>
using bucket_list_storage_t = std::vector< bucket_storage_t<T, entry_allocator_t>, A>;

int main()
{   
    bucket_list_storage_t<entry_t, bucked_list_allocator_t> bucket_list(10);
    for(int i = 0; i < 10; ++i) 
        bucket_list.resize(100);
} 

But possibly you can also use a deque for this?
No wait. The outer list is a static array in your case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants