Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faulty implementation of method diff in subclass BitSetN of scala.collection.immutable.BitSet #10238

Merged
merged 1 commit into from Dec 19, 2022
Merged

Conversation

francesco-kriegel
Copy link
Contributor

@francesco-kriegel francesco-kriegel commented Dec 11, 2022

The implementation of the method diff in class BitSetN is faulty. It returns an empty BitSet for non-equal sets if the differences all occur in the first word. As an example consider the following code.

    val mutableBitSet1 = scala.collection.mutable.BitSet(0, 128)
    val mutableBitSet2 = scala.collection.mutable.BitSet(128)
    val mutableDiff = mutableBitSet1 diff mutableBitSet2
    println("mutable diff: " + mutableDiff)

    val immutableBitSet1 = scala.collection.immutable.BitSet(0, 128)
    val immutableBitSet2 = scala.collection.immutable.BitSet(128)
    val immutableDiff = immutableBitSet1 diff immutableBitSet2
    println("immutable diff: " + immutableDiff)

While mutableDiff contains the correct result, namely a singleton BitSet containing 0, immutableDiff is empty.

The error is still present in Scala 3.2.1, which I use. It would be great if this bug can be resolved in all affected versions.

@scala-jenkins scala-jenkins added this to the 2.13.11 milestone Dec 11, 2022
@francesco-kriegel francesco-kriegel changed the title Update BitSet.scala Faulty implementation of method 'diff' in subclass 'BitSetN' of 'scala.collection.immutable.BitSet' Dec 11, 2022
@francesco-kriegel francesco-kriegel changed the title Faulty implementation of method 'diff' in subclass 'BitSetN' of 'scala.collection.immutable.BitSet' Faulty implementation of method diff in subclass 'BitSetN' of 'scala.collection.immutable.BitSet' Dec 11, 2022
@francesco-kriegel francesco-kriegel changed the title Faulty implementation of method diff in subclass 'BitSetN' of 'scala.collection.immutable.BitSet' Faulty implementation of method diff in subclass BitSetN of scala.collection.immutable.BitSet Dec 11, 2022
@SethTisue SethTisue requested a review from a team December 11, 2022 22:58
@SethTisue SethTisue added the library:collections PRs involving changes to the standard collection library label Dec 11, 2022
@joshlemer
Copy link
Member

Thanks! It would be good to include a unit test for this as well I think.

Copy link
Contributor

@Ichoran Ichoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works, but please copy the logic from filterImpl instead (where if currentWord is nonzero, just packages that), which shortcuts all sorts of unnecessary work, or if filterImpl is actually broken (but I don't think it is!), please fix that one too!

@som-snytt
Copy link
Contributor

som-snytt commented Dec 12, 2022

Wondering if the value ought to be called wordcount instead of nword.

EDIT: because "the N-word" is a euphemism of an epithet in English. The identifier was nwords.

@@ -244,7 +244,7 @@ object BitSet extends SpecificIterableFactory[Int, BitSet] {
anyChanges ||= currentWord != oldWord
i -= 1
}
if (i < 0) {
if (i < 0 && currentWord == 0L) {
// all indices >= 0 have had result 0, so the bitset is empty
this.empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

            if (!anyChanges) this
            else if (currentWord == 0) empty
            else new BitSet1(currentWord)

and similarly at filterImpl.

Maybe also upgrade the comment. This was a bit tricky and off-by-one-ish.

@som-snytt
Copy link
Contributor

At the end of test/junit/scala/collection/immutable/SetTest.scala, more for documentation of the change than for testing.

  @Test def `immutable BitSet diff`: Unit = {
    assertEquals(BitSet(0), BitSet(0, 128).diff(BitSet(128)))
  }

@francesco-kriegel
Copy link
Contributor Author

Wondering if the variable ought to be wordcount instead of nword.

There is no variable wordcount. Referring to nword is correct since it denotes the number of Longs that are used to store the bits (64 bits per each Long).

@francesco-kriegel
Copy link
Contributor Author

I think this works, but please copy the logic from filterImpl instead (where if currentWord is nonzero, just packages that), which shortcuts all sorts of unnecessary work, or if filterImpl is actually broken (but I don't think it is!), please fix that one too!

filterImpl works correctly, but contains unnecessary code, see also my above answer to @som-snytt . I think it's a good idea to correct the method diff as proposed and accordingly adapt filterImpl.

@francesco-kriegel
Copy link
Contributor Author

Thanks! It would be good to include a unit test for this as well I think.

At the end of test/junit/scala/collection/immutable/SetTest.scala, more for documentation of the change than for testing.

  @Test def `immutable BitSet diff`: Unit = {
    assertEquals(BitSet(0), BitSet(0, 128).diff(BitSet(128)))
  }

Thanks, this test should be included.

@francesco-kriegel
Copy link
Contributor Author

Any further comments? Otherwise I will commit the proposed changes and update this pull request.

@francesco-kriegel
Copy link
Contributor Author

By the way: why is there no subclass BitSet0 for empty immutable.BitSets? Is it because, although it might save 64 bit for representing empty, it needs further memory for the compiled class?

@Ichoran
Copy link
Contributor

Ichoran commented Dec 12, 2022

som's logic is the best: efficient and covers all cases. Both diff and filterImpl should use that.

There's no BitSet0 because the overhead of wider multiple dispatch (more distinct implementations in subclasses) outweighs the small storage-size penalty.

@francesco-kriegel
Copy link
Contributor Author

            if (!anyChanges) this
            else if (currentWord == 0) empty
            else new BitSet1(currentWord)

and similarly at filterImpl.

Maybe also upgrade the comment. This was a bit tricky and off-by-one-ish.

Somehow my answer to the above review comment from @som-snytt is not shown. Please find it below.

That happens in the else branch of if (i < 0 && currentWord == 0L) anyway, so there is no need for this change.

Furthermore, the following code fragment in this else branch is superfluous since the if condition can never be true.

if (minimumNonZeroIndex == -1) {
    this.empty
} else

@francesco-kriegel
Copy link
Contributor Author

francesco-kriegel commented Dec 13, 2022

som's logic is the best: efficient and covers all cases. Both diff and filterImpl should use that.

There's no BitSet0 because the overhead of wider multiple dispatch (more distinct implementations in subclasses) outweighs the small storage-size penalty.

While you believe that @som-snytt's proposed change is better, I believe it's mine, namely for the following reasons.

With @som-snytt's code, there is dead code in the else branch. More specifically, the variable minimumNonZeroIndex would never have value 0 or 1 and thus the first two if statements are superfluous, dead code.

With my proposed change, the comment // all indices >= 0 have had result 0, so the bitset is empty of the original author is still valid. What's more, what would be done when the variable minimumNonZeroIndex has value 1 in the else branch is the same as in the else branch of @som-snytt's code. Also here, the variable minimumNonZeroIndex would never have value 0, i.e., the first if statement should be removed.

And if, in @som-snytt's code, the variable anyChanges has value false and thus this is returned, then this would also happen in the else branch (specifically there in the else branch of if (anyChanges)).

@som-snytt
Copy link
Contributor

I only spent a few minutes with the code, so I didn't look at code paths in the other branch.

In a couple of minutes, I understood the enclosing if just well enough to make my suggestion.

Later today, I'll a few more minutes to review the update.

Copy link
Contributor

@som-snytt som-snytt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution.

Could you squash to one commit and push -f, as is usual for this repo?

@francesco-kriegel
Copy link
Contributor Author

Done. All changes are in one single commit.

@francesco-kriegel
Copy link
Contributor Author

FYI: I have just signed the Scala CLA.

@som-snytt
Copy link
Contributor

note to self:

the factory method does minor optimization (testing for zero element) but doesn't test whether one-element of zero should be empty. In any case, here we already compute the necessary elements.

Alternative edit:

          if (!anyChanges) this
          else if (minimumNonZeroIndex == 0)
            if (currentWord == 0) empty else new BitSet1(currentWord)
          else if (minimumNonZeroIndex == 1)
            new BitSet2(word(0) & ~bs.word(0), currentWord)
          else {
            val newArray = elems.take(minimumNonZeroIndex + 1)
            newArray(i + 1) = currentWord
            while (i >= 0) {
              newArray(i) = word(i) & ~bs.word(i)
              i -= 1
            }
            new BitSetN(newArray)
          }

removes some braces.

@Ichoran
Copy link
Contributor

Ichoran commented Dec 13, 2022

Given that the only reason to use BitSet is performance, and at least for most collections the number of elements is small, adding an extra variable assignment, an extra test and jump over a while loop, and then another extra test all seems a bit iffy to me.

It's great that you found and removed the dead code in the other branch--but the == 0 would be dead too if you handled that outcome of the i < 0 case!

So I don't understand the resistance to handling the one-or-zero word case the first time it's detected. It's not a big deal, but if we're already looking at it and thinking it through carefully enough to catch bugs like this, why not leave the logic more efficient than before?

(On the other hand, given that this bug escaped for so long, one could argue that BitSet isn't really used enough to care about, so anything that fixes the bug is fine.)

@som-snytt
Copy link
Contributor

I agree with the concluding parenthetical, in the spirit of minimum viable change. Also only the empty result case is affected.

Personally, I was satisfied to have briefly understood the else block before forgetting again.

@Ichoran
Copy link
Contributor

Ichoran commented Dec 13, 2022

Well, I also agree with the parenthetical, and, @francesco-kriegel, I definitely appreciate the effort to fix the bug! (Catching the dead code is also good!)

It's just that as a matter of principle, I don't think that the "this is better than the alternative fix" argument actually is a particularly strong argument, at least as presented, and I don't think it's good policy to agree with non-compelling arguments even if one could be convinced to go along with the outcome.

"This is a fix, and it clearly works, which is all that's needed" is more compelling.

@som-snytt
Copy link
Contributor

I've lodged a ticket with github to add a button to the review drop-down, "Approve as a fix."

Worth adding, soberly, that one does count bytecodes in source one knows has been closely examined and will be subject to scrutiny again. But I think there is a greater burden of proof to demonstrate one has done no harm, to mix professional metaphors.

I'm not sure all the statements in the comment with all the som-snytts are exactly right; but I also don't want to hold two variables in my head, or, in a pinch, on a napkin. I can follow up later if it is deemed advisable. I see we need not increment on assignment to minimumNonZeroIndex, as we can bump by 2 for newArray size.

When I asked for an updated code comment, I meant to clarify the bug case, that we haven't tested the final word.

// all indices > 0 have had result 0, so if the last word is also zero, the bitset is empty

Not to beat a horse that has already been converted to glue. We'll revisit this code when rewriting under Scala 3 (because inline).

@francesco-kriegel
Copy link
Contributor Author

francesco-kriegel commented Dec 14, 2022

Given that the only reason to use BitSet is performance, and at least for most collections the number of elements is small, adding an extra variable assignment, an extra test and jump over a while loop, and then another extra test all seems a bit iffy to me.

It's great that you found and removed the dead code in the other branch--but the == 0 would be dead too if you handled that outcome of the i < 0 case!

So I don't understand the resistance to handling the one-or-zero word case the first time it's detected. It's not a big deal, but if we're already looking at it and thinking it through carefully enough to catch bugs like this, why not leave the logic more efficient than before?

The size (in memory) of a BitSet depends on its maximal element, not on its cardinality (like for other set implementations). The first Long stores the 64 bits that encode whether the numbers from 0 to 63 are contained, the second Long represents containment information for numbers 64 to 127, and so on. A more space-efficient BitSet implementation would also account for the minimal element, e.g., if elements 0 to 255 are not contained and the minimal element is in the slice 256 to 319, then one can dispense with the first four Longs (which are all 0L).

In the end, I am using BitSets as my implementations are usually faster than with other set implementations. The additional efforts for indexing (one needs maps fro and to the original elements if these are not numbers) are negligible. One could even package this into a class that is based on BitSets and does the conversion between elements and indices as needed. But doing this manually is often faster since one can just work with indices as long as possible. Maybe there is a more elegant approach to this.

Regarding efficiency of the code. I do not know which optimizations the compiler or a VM apply to the code or the byte code, respectively. It might be that, in my originally proposed fix, optimization is easier since there are deterministic ways to returning an empty BitSet, a BitSet1, a BitSet2, or a BitSetN, respectively. And dead code or unnecessarily duplicated can be problematic as it increases code complexity, which seems to be bad for programmers as well as for compilers. Perhaps the compiler or an optimized VM (like GraalVM EE) is intelligent enough to know that it can jump over the while loop in certain cases or can dispense with variables (that are in code but would not be needed in optimized code). I don't know!

But, since we started this discussion regarding efficiency, why not try to find a better implementation? I propose the below replacement for the big if statement. It branches earlier if it is foreseeable that the result will be a BitSet1, a BitSet2, or a BitSetN, respectively. And chains of ifs are replaced by a match (I would expect that this is faster, but don't know for sure).

Worth adding, soberly, that one does count bytecodes in source one knows has been closely examined and will be subject to scrutiny again. But I think there is a greater burden of proof to demonstrate one has done no harm, to mix professional metaphors.

Given that this is code in the standard library, the code should be as fast as possible since it might and will be used in several software projects. I have never checked standard classes delivered with Java or Scala, but am more picky with other, external libraries. So, the below code might provide an implementation that is faster than before.

          i match {
            case -1 => {
              if (anyChanges) {
                if (currentWord == 0) {
                  empty
                } else {
                  new BitSet1(currentWord)
                }
              } else {
                this
              }
            }
            case 0 => {
              val oldFirstWord = word(0)
              val firstWord = oldFirstWord & ~bs.word(0)
              anyChanges ||= firstWord != oldFirstWord
              if (anyChanges) {
                new BitSet2(firstWord, currentWord)
              } else {
                this
              }
            }
            case _ => {
              val minimumNonZeroIndex: Int = i + 1
              while (!anyChanges && i >= 0) {
                val oldWord = word(i)
                currentWord = oldWord & ~bs.word(i)
                anyChanges ||= currentWord != oldWord
                i -= 1
              }
              if (anyChanges) {
                val newArray = elems.take(minimumNonZeroIndex + 1)
                newArray(i + 1) = currentWord
                while (i >= 0) {
                  newArray(i) = word(i) & ~bs.word(i)
                  i -= 1
                }
                new BitSetN(newArray)
              } else {
                this
              }
            }
          }

What do you think?

@som-snytt
Copy link
Contributor

That's close to my last edit, I hope Ichoran chimes in. We might have to check method size after unrolling.

@Ichoran
Copy link
Contributor

Ichoran commented Dec 14, 2022

@francesco-kriegel - That looks more efficient, and I think more in the spirit of the original code (i.e. attempting to be reasonably optimized)!

(Note--the space problems with BitSet are true (negative numbers are even worse!), but pretty much enforced by the interface...one could imagine an offset array model, and I actually worked on one myself back when the library was being redone for 2.12, but it was too hard to get it to fit the interface constraints.)

Case statements do not require braces surrounding them; it would read a little cleaner if it were just

  case -1 =>
    some
    code
  case 0 =>
    someMore
    code
  ...

(The braces around single-statement if/else are more debatable; I like som-snytt's more compact style, but the existing code is pretty brace-heavy, so it's a better style match as you wrote it.)

(Note also that Scala 2 and Scala 3 have different ideas about what a same-line case statement means, i.e. whether case x => hi; hey has the hey as part of the case, so you should always use braces on one-liners with multiple statements...but that's not the situation here!)

@francesco-kriegel
Copy link
Contributor Author

There is yet another problem with the current code. Consider the following example.

    val us = scala.collection.immutable.BitSet(39, 41, 44, 46, 256)
    val vs = scala.collection.immutable.BitSet(39, 41, 44, 46, 64, 256)
    val xs = scala.collection.immutable.BitSet.fromBitMask(us.toBitMask.take(3))
    val ys = scala.collection.immutable.BitSet.fromBitMask(vs.toBitMask.take(3))
    val diff = ys diff xs
    println("xs: " + xs)
    println("ys: " + ys)
    println("diff: " + diff)

The computed diff is empty, but it should consist of the number 64.

This results from going over the while loop and thereby setting the variable currentWord to another value that corresponds to array position 0. So it cannot be used in new BitSet2(word(0) & ~bs.word(0), currentWord).

This should also be resolved in my above proposal.

@francesco-kriegel
Copy link
Contributor Author

francesco-kriegel commented Dec 14, 2022

Case statements do not require braces surrounding them

Thanks for the hint. I used braces to preclude any issues due to wrong indentation. But the no-brace style is clearly much more readable. By the way, the Scala plugin in IntelliJ needs more CPU time the more no-brace style is used. (That might still be a reason to avoid it....)

@francesco-kriegel
Copy link
Contributor Author

This should also be resolved in my above proposal.

Indeed, with the above big match the correct result BitSet(64) is computed.

I am starting to wonder whether there are further problem cases.

@francesco-kriegel
Copy link
Contributor Author

Note--the space problems with BitSet are true (negative numbers are even worse!)

If one needs to represent negative numbers as well, then one can simply have two Array[Long]: one for the positive numbers and the other for the negative numbers. That's easy to implement as one only needs to "mirror" the code.

I do think that BitSet is a space-efficient representation of not too sparsely filled sets. Assuming a base set of 10000 elements, a subset represented as a BitSet needs approximately 5000 bits (sometimes more, sometimes fewer, depending on the maximal element). Now, for a density of, say, 10 per cent, this means 5000 bits for 1000 elements, i.e., 5 bits per element. I doubt that there is an object-based set representation that can offer to store an element using only 5 bits. With a density of 1 per cent that would be 50 bits. Do you know how much bits a HashSet needs for storing one element? Of course, one still needs conversion between indices and the referenced objects themselves, but if one creates a lot of subsets, then using BitSet seems to be the way to go.

@francesco-kriegel
Copy link
Contributor Author

I am starting to wonder whether there are further problem cases.

In order to find this out, I am currently using the below method instead of diff. If that reveals further bugs, then I will let you know.

  implicit class LocalBitSet(bs: collection.BitSet) {
    def diffOrReportBug(other: collection.BitSet): collection.BitSet = {
      val diff = bs diff other
      val expected = collection.mutable.BitSet.fromSpecific(bs).subtractAll(other)
      if (diff equals expected)
        diff
      else
        println("BitSet 1: " + bs)
        println("BitSet 2: " + other)
        println("Computed Difference: " + diff)
        println("Expected Difference: " + expected)
        throw RuntimeException()
    }
  }

@Ichoran
Copy link
Contributor

Ichoran commented Dec 14, 2022

Oh, nice catch on the two-element case!

Regarding on how to implement negatives: a single array with a negative offset is probably better...but in any case, the API assumes both input and output of a single 0-indexed bit array, so neither approach suffices. Either a significant API update would be needed, or a whole new class.

@som-snytt
Copy link
Contributor

Travis does a Scala 3 build.

[error] -- [E049] Reference Error: /home/travis/build/scala/scala/src/library/scala/collection/immutable/BitSet.scala:251:18 
[error] 251 |                  empty
[error]     |                  ^^^^^
[error]     |                  Reference to empty is ambiguous,
[error]     |                  it is both defined in object BitSet
[error]     |                  and inherited subsequently in class BitSetN
[error]     |
[error]     | longer explanation available when compiling with `-explain`
[error] -- [E049] Reference Error: /home/travis/build/scala/scala/src/library/scala/collection/immutable/BitSet.scala:330:14 
[error] 330 |              empty
[error]     |              ^^^^^
[error]     |              Reference to empty is ambiguous,
[error]     |              it is both defined in object BitSet
[error]     |              and inherited subsequently in class BitSetN
[error]     |
[error]     | longer explanation available when compiling with `-explain`

@lrytz why we need the new behavior in scala 2 sooner than later.

@som-snytt
Copy link
Contributor

That is an especially good example of the ambiguity, as I wasn't sure recently if empty needed to be dispatched through this.empty. To clarify, in Scala 3 and #10177 the inherited member no longer shadows the definition in scope.

@lrytz
Copy link
Member

lrytz commented Dec 16, 2022

@som-snytt nice! #10220 is ready, in case you're up for a review.

…Set.BitSetN' and aligned similar method 'filterImpl'
@francesco-kriegel
Copy link
Contributor Author

empty is replaced by this.empty in order to avoid ambiguity. The CI server should now be able to successfully build the snapshot. Let's wait for the result to be sure.

My double test against collection.mutable.BitSet.fromSpecific(bs).subtractAll(other) has not revealed further faulty cases. To be more precise, diff in immutable.BitSet is now as correct as subtractAll in mutable.BitSet is. Furthermore, I believe that the corrected code now calculates the correct results, but I have not proved this in a formal way.

@lrytz
Copy link
Member

lrytz commented Dec 19, 2022

Thank you @francesco-kriegel, @som-snytt and @Ichoran.

I was going to ask if we could write a Scalacheck test, but saw there's one. Then I was going to ask why it didn't catch this bug, but I now see why. With two random bitsets, it's really unlikely to hit that case (BitSet(x) diff BitSet(x, y) for pretty specific x and y -- the two bitsets are generated independently at random).

The diff LGTM. Good to go for everyone?

@Ichoran
Copy link
Contributor

Ichoran commented Dec 19, 2022

LGTM!

@som-snytt som-snytt merged commit 2be3b85 into scala:2.13.x Dec 19, 2022
@som-snytt
Copy link
Contributor

thanks @francesco-kriegel @Ichoran that was diverting.

@francesco-kriegel francesco-kriegel deleted the patch-1 branch January 26, 2023 13:24
francesco-kriegel added a commit to francesco-kriegel/efficient-axiomatization-of-owl2el-ontologies-from-data that referenced this pull request Dec 11, 2023
….13.11` since my pull request #10238 has been integrated into the main branch, see scala/scala#10238
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
library:collections PRs involving changes to the standard collection library
Projects
None yet
7 participants