Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruby concurrent map decoding sporadically has empty results #6532

Closed
haberman opened this issue Aug 19, 2019 · 3 comments · Fixed by #6533 or #6541
Closed

Ruby concurrent map decoding sporadically has empty results #6532

haberman opened this issue Aug 19, 2019 · 3 comments · Fixed by #6533 or #6541

Comments

@haberman
Copy link
Member

The Ruby unit tests occasionally fail with:

Failure: test_concurrent_decoding(BasicTest::MessageContainerTest)
/tmp/protobuf/protobuf/ruby/tests/basic.rb:245:in `block (3 levels) in test_concurrent_decoding'
     242:       thds = 2.times.map do
     243:         Thread.new do
     244:           100000.times do
  => 245:             assert_equal o, Outer.decode(raw)
     246:           end
     247:         end
     248:       end
/tmp/protobuf/protobuf/ruby/tests/basic.rb:244:in `times'
/tmp/protobuf/protobuf/ruby/tests/basic.rb:244:in `block (2 levels) in test_concurrent_decoding'
<<BasicTest::Outer: items: {0=><BasicTest::Inner: >}>> expected but was
<<BasicTest::Outer: items: {}>>
diff:
? <BasicTest::Outer: items: {0=><BasicTest::Inner: >}>

Here is a sample failure in Kokoro.

This was first observed on the PR #5866, so is presumed to be related. However the bug is very sporadic and difficult to reproduce, so it's hard to say for sure when it was introduced.

Previously I tried the following strategies to make the bug more reproducible:

  • I compiled a custom Ruby that uses 1ms timeslices instead of 100ms
  • I've increased the number of map entries to 1,000
  • I set GC.stress = 0x01 | 0x04 like gc_test.rb
@haberman
Copy link
Member Author

I've had the most luck reproducing this on Ruby 2.3, so that's what I've been using.

This doesn't appear to be GC-related. I can reproduce the error even if the test is wrapped in GC.disable, GC.enable.

I can get this modified version of the test to fail about 1 out of 15 times on the Ruby 2.3 docker image.

    def test_concurrent_decoding
      GC.start
      GC.disable
      o = Outer.new
      o.items[0] = Inner.new
      raw = Outer.encode(o)

      thds = 2.times.map do |t|
        Thread.new do
          assert_equal o, Outer.decode(raw), "thread #{t}, raw=#{raw.length}"
        end
      end
      thds.map(&:join)
      GC.enable
    end

Run in a loop with:

while true; do  bundle exec ruby tests/basic.rb -n test_concurrent_decoding ; done

But oddly, adding more looping in Ruby itself doesn't help the bug reproduce more often. This suggests that there is some process state in play here, and we only get a chance to trigger the bug in a fresh Ruby process.

Perhaps it has to do with starting threads. Maybe Thread.new has to spawn a real pthread the first time, and the bug is triggered only by this initial creating of the thread.

@haberman
Copy link
Member Author

I've narrowed this down to a race condition in the lazy initialization of decoder state in Outer.decode(raw).

Specifically, if I call Outer.decode(raw) before spawning any threads, the failure goes away.

It seems that some calls into the Ruby API from C must be capable of releasing the GVL, causing us to re-enter the protobuf C extension from a different thread. I am curious to know more about what operations can do this. Can even a hash table lookup do this (if so, this call is suspect:

def = rb_hash_aref(descriptor_pool->def_to_descriptor, key);
)

@samphippen can you comment? What calls into Ruby can trigger a thread switch and GVL release? Is it any call into Ruby, or just some calls?

@haberman
Copy link
Member Author

To answer my own question, it sounds like anything that triggers a GC, or possibly evaluates some Ruby, has the potential of switching Ruby threads and releasing the GVL.

This is the same issue explained by @tenderlove in #3560 (comment)

I have a fix in #6541 . Locally it seems to reliably fix the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant