Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constant-time cardinality? #410

Open
kosak opened this issue Nov 15, 2022 · 2 comments
Open

Constant-time cardinality? #410

kosak opened this issue Nov 15, 2022 · 2 comments

Comments

@kosak
Copy link
Contributor

kosak commented Nov 15, 2022

Is there any appetite for being able to provide cardinality in constant time, on both the 32-bit and 64-bit sides? As a customer of the library I feel like I've wanted this.

Right now it's calculated on demand. In Roaring64Map, the method cardinality() does a full scan of the map every time, and even isFull() will do a lot of work if applied to a map with a large dense prefix of full 32-bit bitmaps. isStrictSubset() also does two cardinality() calls that are not, from an algorithmic standpoint, strictly necessary, so that's a performance drag. On the 32-bit side, it's not constant-time either, requiring an iteration over the containers and an inner iteration over the run containers if there are any.

Providing constant-time cardinality seems possible but not necessarily "easy" so I wanted to check with you first.

I see three approaches here and I wanted to ask your opinion:

  1. Change both the 32-bit side and 64-bit side to provide constant-time cardinality.
  2. Change only the 64-bit side. The 64 bit side would do its own accounting to track the total cardinality. I think this is tricky but perhaps possible.
  3. Don't track cardinality. Then I could at the very least rewrite isStrictSubset so it does a more clever subset calculation that doesn't require any calls to cardinality()
@madscientist
Copy link
Contributor

madscientist commented Nov 18, 2022

I would find this useful. I'm not sure how tricky it would be for all operators. Possibly it would be useful to consider a lazy-evaluation type of thing where when we modify the bitmap in a way in which the new cardinality is not simple to compute (ORing two bitmaps for example) we set the cardinality to an invalid value, then when someone invokes cardinality() and the value is invalid we'd compute it then.

Of course we have to be able to determine an "invalid" value. Maybe using 0 would be OK; if the value is 0 then we'd check to see if the cardinality is really 0 which should be fast; if not then compute the real cardinality.

One potential death-knell for this idea is that it makes cardinality() no longer a read-only operation... it could modify the bitmap metadata. It wouldn't modify the content of the bitmap but it could still be problematic.

@CharlesChen888
Copy link
Contributor

CharlesChen888 commented Nov 24, 2022

we set the cardinality to an invalid value, then when someone invokes cardinality() and the value is invalid we'd compute it then.

Sometimes we need cardinality to determine whether it is necessary to change container (when ORing two arrays, if the cardinality of the result reaches a threshold, the result should be a bitset). So we need more than an invalid value.

Is there a fast method to approximatly estimate cardinality? We can use an estimated value when we need to lower the performance drag, then mark the cardinality as "not precise", and calculate a precise one when cardinality() is called.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants