Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LinkedHash(Map|Set) have weird complexity #2727

Open
j-baker opened this issue Nov 29, 2022 · 2 comments
Open

LinkedHash(Map|Set) have weird complexity #2727

j-baker opened this issue Nov 29, 2022 · 2 comments

Comments

@j-baker
Copy link
Contributor

j-baker commented Nov 29, 2022

Mostly leaving this as a note. LinkedHashMap and LinkedHashSet are, as implemented in the Java standard library, pretty cool structures. Map entries contain a doubly linked list to next and previous elements, and because of the mutability, the cost of maintaining the list structure is always O(1). Effectively, you have a structure which can be treated as both a list and a set without the downsides in most practical cases.

In Vavr, these structures are implemented as a pair of a HashMap and a Queue. This pairing is rather uncool, because the complexity of various map operations becomes case-dependent.

For example (all complexities are listed as complexities on the queue, not on the paired HashMap):

When doing a put, if the element does not exist, the put is O(1). However, doing a put on an element that exists already is O(n) because the element's node is replaced in the queue (and I think the entire queue is copied at this time).

Likewise, doing a remove is O(n) because the queue must be rewritten.

The queue is stored as a head and tail with the tail possibly being reversed on dequeue. However, this is only amortized efficient when the reading operation actually dequeues (e.g. n enqs followed by n deqs will do one queue reversal, but n enqs followed by n iterations through the queue will do n queue reversals).

// This function is O(n^2) though the user will expect it to be O(nlogn + n), since calling head will reverse the entire queue.
int doSomethingExpensive(int n) {
    LinkedHashSet<Integer> myInts = LinkedHashSet.range(0, 100_000);
    int sum = 0;
    for (int i = 0; i < 100_000; i++) {
        // the head call is presently implemented as .iterator().next() which will prepare the second element for reading, and
        // thus read the entire queue. This could be improved, but the issue remains that reading prefixes of an unchanging map can be expected linear in the entire size of the map.
        sum += doSomethingCheapWith(myInts.head());
}

In practice this means that the datastructure is genuinely much less effective than the normal HashMap - one converting their HashMap into a LinkedHashMap should expect a huge performance degradation which is unexpected to one coming from normal Java maps.

An implementation with perhaps slightly poorer top-end performance but that would have much more forgiving worst case performance might be to do something more like:

- state of LinkedHashMap is (HashMap<K, Tuple2<V, Long>> map, TreeMap<Long, Tuple2<K, V>> indexes, Long nextIndex).
- Inserting a new element returns (map.put(key, tuple(value, nextIndex)), indexes.put(nextIndex, tuple(key, value)), nextIndex++)
- overwriting returns (map.put(key, tuple(value, map.get(key)._2), indexes.put(map.get(key)._2, tuple(key, value), nextIndex);
- remove returns (map.remove(key), indexes.remove(map.get(key)._2), nextIndex)
- iteration is indexes.values().

This structure would be O(logn) on all operations, and generally avoid the sharp edges of the current structure. It would use quite a bit more memory, although this could be minimized by e.g. avoiding boxing of longs or at least sharing them. Of course, it's possible that top-end performance would suffer, especially since inserting in sorted order is the worst case for a red-black tree.

j-baker added a commit to j-baker/vavr that referenced this issue Nov 29, 2022
While writing up vavr-io#2727 I noticed that `LinkedHashSet.head()` is implemented
`iterator().head()`. This is inefficient because `queue.iterator().next()`
makes a call to `.head()`, but also to `.tail()` so as to prepare for the
next `head()` call. Given the structure of the underlying `Queue`, the
`head()` call is worst case `O(1)` but the tail call is worst case
`O(n)`. The present worst case will be achieved if there have never been
overwrites or removals from the set, which is probably a fairly common
case.
danieldietrich pushed a commit that referenced this issue Jan 3, 2023
While writing up #2727 I noticed that `LinkedHashSet.head()` is implemented
`iterator().head()`. This is inefficient because `queue.iterator().next()`
makes a call to `.head()`, but also to `.tail()` so as to prepare for the
next `head()` call. Given the structure of the underlying `Queue`, the
`head()` call is worst case `O(1)` but the tail call is worst case
`O(n)`. The present worst case will be achieved if there have never been
overwrites or removals from the set, which is probably a fairly common
case.
@wrandelshofer
Copy link

I made a draft implementation of a CHAMP-trie based VAVR-collection, that performs all operations in constant time.
The class has currently the name SequencedChampMap. Performance is not stellar, but at least, there are no performance cliffs anymore. The code is in an experimental state.
https://github.com/wrandelshofer/vavr/blob/6569c67e3d5c3b4dae681fa762c714c8d074d7ab/README.md

@wrandelshofer
Copy link

My pull-request 2745 is supposed to address this problem for LinkedHashSet and LinkedHashMap.
#2745

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants