New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faulty implementation of method diff
in subclass BitSetN
of scala.collection.immutable.BitSet
#10238
Conversation
diff
in subclass 'BitSetN' of 'scala.collection.immutable.BitSet'
diff
in subclass 'BitSetN' of 'scala.collection.immutable.BitSet'diff
in subclass BitSetN
of scala.collection.immutable.BitSet
Thanks! It would be good to include a unit test for this as well I think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this works, but please copy the logic from filterImpl
instead (where if currentWord
is nonzero, just packages that), which shortcuts all sorts of unnecessary work, or if filterImpl
is actually broken (but I don't think it is!), please fix that one too!
Wondering if the value ought to be called EDIT: because "the N-word" is a euphemism of an epithet in English. The identifier was |
@@ -244,7 +244,7 @@ object BitSet extends SpecificIterableFactory[Int, BitSet] { | |||
anyChanges ||= currentWord != oldWord | |||
i -= 1 | |||
} | |||
if (i < 0) { | |||
if (i < 0 && currentWord == 0L) { | |||
// all indices >= 0 have had result 0, so the bitset is empty | |||
this.empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!anyChanges) this
else if (currentWord == 0) empty
else new BitSet1(currentWord)
and similarly at filterImpl
.
Maybe also upgrade the comment. This was a bit tricky and off-by-one-ish.
At the end of
|
There is no variable |
|
Thanks, this test should be included. |
Any further comments? Otherwise I will commit the proposed changes and update this pull request. |
By the way: why is there no subclass |
som's logic is the best: efficient and covers all cases. Both There's no |
Somehow my answer to the above review comment from @som-snytt is not shown. Please find it below. That happens in the else branch of Furthermore, the following code fragment in this else branch is superfluous since the if condition can never be
|
While you believe that @som-snytt's proposed change is better, I believe it's mine, namely for the following reasons. With @som-snytt's code, there is dead code in the else branch. More specifically, the variable With my proposed change, the comment And if, in @som-snytt's code, the variable |
I only spent a few minutes with the code, so I didn't look at code paths in the other branch. In a couple of minutes, I understood the enclosing if just well enough to make my suggestion. Later today, I'll a few more minutes to review the update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution.
Could you squash to one commit and push -f
, as is usual for this repo?
Done. All changes are in one single commit. |
FYI: I have just signed the Scala CLA. |
note to self: the factory method does minor optimization (testing for zero element) but doesn't test whether one-element of zero should be empty. In any case, here we already compute the necessary elements. Alternative edit:
removes some braces. |
Given that the only reason to use BitSet is performance, and at least for most collections the number of elements is small, adding an extra variable assignment, an extra test and jump over a while loop, and then another extra test all seems a bit iffy to me. It's great that you found and removed the dead code in the other branch--but the So I don't understand the resistance to handling the one-or-zero word case the first time it's detected. It's not a big deal, but if we're already looking at it and thinking it through carefully enough to catch bugs like this, why not leave the logic more efficient than before? (On the other hand, given that this bug escaped for so long, one could argue that BitSet isn't really used enough to care about, so anything that fixes the bug is fine.) |
I agree with the concluding parenthetical, in the spirit of minimum viable change. Also only the empty result case is affected. Personally, I was satisfied to have briefly understood the |
Well, I also agree with the parenthetical, and, @francesco-kriegel, I definitely appreciate the effort to fix the bug! (Catching the dead code is also good!) It's just that as a matter of principle, I don't think that the "this is better than the alternative fix" argument actually is a particularly strong argument, at least as presented, and I don't think it's good policy to agree with non-compelling arguments even if one could be convinced to go along with the outcome. "This is a fix, and it clearly works, which is all that's needed" is more compelling. |
I've lodged a ticket with github to add a button to the review drop-down, "Approve as a fix." Worth adding, soberly, that one does count bytecodes in source one knows has been closely examined and will be subject to scrutiny again. But I think there is a greater burden of proof to demonstrate one has done no harm, to mix professional metaphors. I'm not sure all the statements in the comment with all the som-snytts are exactly right; but I also don't want to hold two variables in my head, or, in a pinch, on a napkin. I can follow up later if it is deemed advisable. I see we need not increment on assignment to When I asked for an updated code comment, I meant to clarify the bug case, that we haven't tested the final word.
Not to beat a horse that has already been converted to glue. We'll revisit this code when rewriting under Scala 3 (because |
The size (in memory) of a BitSet depends on its maximal element, not on its cardinality (like for other set implementations). The first Long stores the 64 bits that encode whether the numbers from 0 to 63 are contained, the second Long represents containment information for numbers 64 to 127, and so on. A more space-efficient BitSet implementation would also account for the minimal element, e.g., if elements 0 to 255 are not contained and the minimal element is in the slice 256 to 319, then one can dispense with the first four Longs (which are all 0L). In the end, I am using BitSets as my implementations are usually faster than with other set implementations. The additional efforts for indexing (one needs maps fro and to the original elements if these are not numbers) are negligible. One could even package this into a class that is based on BitSets and does the conversion between elements and indices as needed. But doing this manually is often faster since one can just work with indices as long as possible. Maybe there is a more elegant approach to this. Regarding efficiency of the code. I do not know which optimizations the compiler or a VM apply to the code or the byte code, respectively. It might be that, in my originally proposed fix, optimization is easier since there are deterministic ways to returning an empty BitSet, a BitSet1, a BitSet2, or a BitSetN, respectively. And dead code or unnecessarily duplicated can be problematic as it increases code complexity, which seems to be bad for programmers as well as for compilers. Perhaps the compiler or an optimized VM (like GraalVM EE) is intelligent enough to know that it can jump over the while loop in certain cases or can dispense with variables (that are in code but would not be needed in optimized code). I don't know! But, since we started this discussion regarding efficiency, why not try to find a better implementation? I propose the below replacement for the big if statement. It branches earlier if it is foreseeable that the result will be a BitSet1, a BitSet2, or a BitSetN, respectively. And chains of ifs are replaced by a match (I would expect that this is faster, but don't know for sure).
Given that this is code in the standard library, the code should be as fast as possible since it might and will be used in several software projects. I have never checked standard classes delivered with Java or Scala, but am more picky with other, external libraries. So, the below code might provide an implementation that is faster than before.
What do you think? |
That's close to my last edit, I hope Ichoran chimes in. We might have to check method size after unrolling. |
@francesco-kriegel - That looks more efficient, and I think more in the spirit of the original code (i.e. attempting to be reasonably optimized)! (Note--the space problems with Case statements do not require braces surrounding them; it would read a little cleaner if it were just case -1 =>
some
code
case 0 =>
someMore
code
... (The braces around single-statement if/else are more debatable; I like som-snytt's more compact style, but the existing code is pretty brace-heavy, so it's a better style match as you wrote it.) (Note also that Scala 2 and Scala 3 have different ideas about what a same-line case statement means, i.e. whether |
There is yet another problem with the current code. Consider the following example.
The computed diff is empty, but it should consist of the number 64. This results from going over the while loop and thereby setting the variable This should also be resolved in my above proposal. |
Thanks for the hint. I used braces to preclude any issues due to wrong indentation. But the no-brace style is clearly much more readable. By the way, the Scala plugin in IntelliJ needs more CPU time the more no-brace style is used. (That might still be a reason to avoid it....) |
Indeed, with the above big I am starting to wonder whether there are further problem cases. |
If one needs to represent negative numbers as well, then one can simply have two I do think that |
In order to find this out, I am currently using the below method instead of
|
Oh, nice catch on the two-element case! Regarding on how to implement negatives: a single array with a negative offset is probably better...but in any case, the API assumes both input and output of a single 0-indexed bit array, so neither approach suffices. Either a significant API update would be needed, or a whole new class. |
Travis does a Scala 3 build.
@lrytz why we need the new behavior in scala 2 sooner than later. |
That is an especially good example of the ambiguity, as I wasn't sure recently if |
@som-snytt nice! #10220 is ready, in case you're up for a review. |
…Set.BitSetN' and aligned similar method 'filterImpl'
My double test against |
Thank you @francesco-kriegel, @som-snytt and @Ichoran. I was going to ask if we could write a Scalacheck test, but saw there's one. Then I was going to ask why it didn't catch this bug, but I now see why. With two random bitsets, it's really unlikely to hit that case ( The diff LGTM. Good to go for everyone? |
LGTM! |
thanks @francesco-kriegel @Ichoran that was diverting. |
….13.11` since my pull request #10238 has been integrated into the main branch, see scala/scala#10238
The implementation of the method
diff
in classBitSetN
is faulty. It returns an empty BitSet for non-equal sets if the differences all occur in the first word. As an example consider the following code.While
mutableDiff
contains the correct result, namely a singleton BitSet containing 0,immutableDiff
is empty.The error is still present in Scala 3.2.1, which I use. It would be great if this bug can be resolved in all affected versions.