Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seg fault using Ruby google-protobuf v3.15.8 #8559

Closed
stanhu opened this issue May 4, 2021 · 27 comments
Closed

Seg fault using Ruby google-protobuf v3.15.8 #8559

stanhu opened this issue May 4, 2021 · 27 comments

Comments

@stanhu
Copy link
Contributor

stanhu commented May 4, 2021

What version of protobuf and what language are you using?
Version: v3.15.8
Language: Ruby

What operating system (Linux, Windows, ...) and version?

Linux

What runtime / compiler are you using (e.g., python version or gcc version)

$ docker run -it registry.gitlab.com/gitlab-org/gitlab-build-images:ruby-2.7.2.patched-golang-1.14-git-2.31-lfs-2.9-chrome-89-node-14.15-yarn-1.22-postgresql-11-graphicsmagick-1.3.36 bash
Unable to find image 'registry.gitlab.com/gitlab-org/gitlab-build-images:ruby-2.7.2.patched-golang-1.14-git-2.31-lfs-2.9-chrome-89-node-14.15-yarn-1.22-postgresql-11-graphicsmagick-1.3.36' locally
ruby-2.7.2.patched-golang-1.14-git-2.31-lfs-2.9-chrome-89-node-14.15-yarn-1.22-postgresql-11-graphicsmagick-1.3.36: Pulling from gitlab-org/gitlab-build-images
bd8f6a7501cc: Already exists
750858b04380: Pull complete
3826b530b192: Downloading [======================>                            ]  111.2MB/242.5MB
3826b530b192: Pull complete
714f4683e9a8: Pull complete
f722c9addae9: Pull complete
3b9016f50984: Pull complete
1116e939b23c: Pull complete
da229ef1ac62: Pull complete
67025ec68add: Pull complete
2b071db8eead: Pull complete
04721e1e144e: Pull complete
08081dca5877: Pull complete
e94f33cf5d42: Pull complete
Digest: sha256:eb4fdeb3196481dd022d7d166468834e40a046b46a29e3349fb3bf3657176290
Status: Downloaded newer image for registry.gitlab.com/gitlab-org/gitlab-build-images:ruby-2.7.2.patched-golang-1.14-git-2.31-lfs-2.9-chrome-89-node-14.15-yarn-1.22-postgresql-11-graphicsmagick-1.3.36
gcc --version
root@785c57d8e498:/# gcc --version
gcc (Debian 8.3.0-6) 8.3.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
root@785c57d8e498:/# ruby --version
ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]

What did you do?

Still working on a reproduction step, but we upgraded from google-protobuf v3.14.0 to v3.15.8 and started seeing intermiittent seg faults in CI. It looks like some issue with encoding a protobuf in gRPC.

We may need to turn on debug symbols in the protobuf.so because we aren't able to see the function and line numbers in the backtrace.

What did you expect to see

No seg fault

What did you see instead?

Seg fault

Make sure you include information that can help us debug (full error message, exception listing, stack trace, logs).

This is the relevant information from the job.log (job.log):

/builds/gitlab-org/gitlab-foss/vendor/ruby/2.7.0/gems/grpc-1.30.2-x86_64-linux/src/ruby/lib/grpc/generic/rpc_desc.rb:35: [BUG] Segmentation fault at 0x0000000000000000
ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]

-- Control frame information -----------------------------------------------
c:0204 p:---- s:1305 e:001304 CFUNC  :encode
c:0203 p:0012 s:1300 e:001299 BLOCK  /builds/gitlab-org/gitlab-foss/vendor/ruby/2.7.0/gems/grpc-1.30.2-x86_64-linux/src/ruby/lib/grpc/generic/rpc_desc.rb:35
c:0202 p:0029 s:1296 e:001294 METHOD /builds/gitlab-org/gitlab-foss/vendor/ruby/2.7.0/gems/grpc-1.30.2-x86_64-linux/src/ruby/lib/grpc/generic/active_call.rb:438
c:0201 p:0013 s:1285 e:001284 BLOCK  /builds/gitlab-org/gitlab-foss/vendor/ruby/2.7.0/gems/grpc-1.30.2-x86_64-linux/src/ruby/lib/grpc/generic/client_stub.rb:347
c:0200 p:0013 s:1282 e:001281 METHOD /builds/gitlab-org/gitlab-foss/vendor/ruby/2.7.0/gems/grpc-1.30.2-x86_64-linux/src/ruby/lib/grpc/generic/interceptors.rb:170
c:0199 p:0093 s:1275 e:001274 METHOD /builds/gitlab-org/gitlab-foss/vendor/ruby/2.7.0/gems/grpc-1.30.2-x86_64-linux/src/ruby/lib/grpc/generic/client_stub.rb:346
c:0198 p:0070 s:1256 e:001255 BLOCK  /builds/gitlab-org/gitlab-foss/vendor/ruby/2.7.0/gems/grpc-1.30.2-x86_64-linux/src/ruby/lib/grpc/generic/service.rb:181 [FINISH]
c:0197 p:0063 s:1250 e:001249 METHOD /builds/gitlab-org/gitlab-foss/lib/gitlab/gitaly_client.rb:177
c:0196 p:0034 s:1238 e:001237 BLOCK  /builds/gitlab-org/gitlab-foss/lib/gitlab/gitaly_client/call.rb:18
c:0195 p:0024 s:1235 e:001234 METHOD /builds/gitlab-org/gitlab-foss/lib/gitlab/gitaly_client/call.rb:55
c:0194 p:0004 s:1231 e:001230 METHOD /builds/gitlab-org/gitlab-foss/lib/gitlab/gitaly_client/call.rb:17
c:0193 p:0047 s:1224 e:001223 METHOD /builds/gitlab-org/gitlab-foss/lib/gitlab/gitaly_client.rb:167
c:0192 p:0270 s:1212 e:001211 METHOD /builds/gitlab-org/gitlab-foss/lib/gitlab/gitaly_client/commit_service.rb:351
c:0191 p:0008 s:1205 e:001204 BLOCK  /builds/gitlab-org/gitlab-foss/lib/gitlab/git/repository.rb:355
c:0190 p:0005 s:1202 e:001201 METHOD /builds/gitlab-org/gitlab-foss/lib/gitlab/git/wraps_gitaly_errors.rb:7
c:0189 p:0102 s:1196 E:000ee8 METHOD /builds/gitlab-org/gitlab-foss/lib/gitlab/git/repository.rb:354 [FINISH]
c:0188 p:---- s:1189 e:001188 CFUNC  :public_send
-- C level backtrace information -------------------------------------------
/usr/local/lib/libruby.so.2.7(rb_vm_bugreport+0x562) [0x7f0f0004cb72] vm_dump.c:755
/usr/local/lib/libruby.so.2.7(rb_bug_for_fatal_signal+0xef) [0x7f0effe8029f] error.c:660
/usr/local/lib/libruby.so.2.7(sigsegv+0x52) [0x7f0efffb38b2] signal.c:946
/lib/x86_64-linux-gnu/libpthread.so.0(__restore_rt+0x0) [0x7f0effb82730]
/builds/gitlab-org/gitlab-foss/vendor/ruby/2.7.0/gems/google-protobuf-3.15.8-x86_64-linux/lib/google/2.7/protobuf_c.so(0x7f0ee7f123f1) [0x7f0ee7f123f1]
/builds/gitlab-org/gitlab-foss/vendor/ruby/2.7.0/gems/google-protobuf-3.15.8-x86_64-linux/lib/google/2.7/protobuf_c.so(0x7f0ee7f12ca5) [0x7f0ee7f12ca5]
/builds/gitlab-org/gitlab-foss/vendor/ruby/2.7.0/gems/google-protobuf-3.15.8-x86_64-linux/lib/google/2.7/protobuf_c.so(0x7f0ee7f09ae7) [0x7f0ee7f09ae7]
/usr/local/lib/libruby.so.2.7(vm_call_cfunc_with_frame+0x4c) [0x7f0f0002b110] vm_insnhelper.c:2514
/usr/local/lib/libruby.so.2.7(vm_call_cfunc) vm_insnhelper.c:2539
/usr/local/lib/libruby.so.2.7(vm_call_method_each_type+0xec) [0x7f0f0004441c] vm_insnhelper.c:2925
/usr/local/lib/libruby.so.2.7(vm_call_method+0x59) [0x7f0f00044bc9] vm_insnhelper.c:3026
/usr/local/lib/libruby.so.2.7(vm_call_opt_send+0x1ba) [0x7f0f0004525a] vm_insnhelper.c:2661
/usr/local/lib/libruby.so.2.7(vm_sendish+0x21) [0x7f0f00036855] vm_insnhelper.c:4023
/usr/local/lib/libruby.so.2.7(vm_exec_core) insns.def:801
/usr/local/lib/libruby.so.2.7(rb_vm_exec+0x156) [0x7f0f0003c5e6] vm.c:1920
/usr/local/lib/libruby.so.2.7(rb_ec_vm_ptr+0x0) [0x7f0f0003d56e] vm.c:1074
/usr/local/lib/libruby.so.2.7(rb_vm_global_hooks) vm_core.h:1949
/usr/local/lib/libruby.so.2.7(invoke_bmethod) vm.c:1076
/usr/local/lib/libruby.so.2.7(invoke_iseq_block_from_c) vm.c:1119
/usr/local/lib/libruby.so.2.7(invoke_block_from_c_proc) vm.c:1216
/usr/local/lib/libruby.so.2.7(rb_vm_invoke_bmethod) vm.c:1245
/usr/local/lib/libruby.so.2.7(vm_call_bmethod+0x93) [0x7f0f000442e3] vm_insnhelper.c:2570
/usr/local/lib/libruby.so.2.7(vm_call_method_each_type+0x30c) [0x7f0f0004463c] vm_insnhelper.c:2956
/usr/local/lib/libruby.so.2.7(vm_call_method+0x59) [0x7f0f00044bc9] vm_insnhelper.c:3026
/usr/local/lib/libruby.so.2.7(vm_call_opt_send+0x1ba) [0x7f0f0004525a] vm_insnhelper.c:2661
/usr/local/lib/libruby.so.2.7(vm_sendish+0x21) [0x7f0f00036855] vm_insnhelper.c:4023
/usr/local/lib/libruby.so.2.7(vm_exec_core) insns.def:801
/usr/local/lib/libruby.so.2.7(rb_vm_exec+0x156) [0x7f0f0003c5e6] vm.c:1920
/usr/local/lib/libruby.so.2.7(vm_call0_body+0x1a4) [0x7f0f0003e154] vm_eval.c:136
/usr/local/lib/libruby.so.2.7(rb_vm_call0+0xaf) [0x7f0f0003e95f] vm_eval.c:52
/usr/local/lib/libruby.so.2.7(rb_vm_call_kw+0x66) [0x7f0f0003ec16] vm_eval.c:268
/usr/local/lib/libruby.so.2.7(send_internal+0x165) [0x7f0f0003f2f5] vm_eval.c:1135
/usr/local/lib/libruby.so.2.7(send_internal_kw+0x3e) [0x7f0f0003f4c0] vm_eval.c:1158
/usr/local/lib/libruby.so.2.7(rb_f_public_send) vm_eval.c:1210
/usr/local/lib/libruby.so.2.7(vm_call_cfunc_with_frame+0x4c) [0x7f0f0002b110] vm_insnhelper.c:2514
/usr/local/lib/libruby.so.2.7(vm_call_cfunc) vm_insnhelper.c:2539
/usr/local/lib/libruby.so.2.7(vm_sendish+0x22) [0x7f0f0003690c] vm_insnhelper.c:4023
/usr/local/lib/libruby.so.2.7(vm_exec_core) insns.def:782
/usr/local/lib/libruby.so.2.7(rb_vm_exec+0x9cd) [0x7f0f0003ce5d] vm.c:1929
/usr/local/lib/libruby.so.2.7(invoke_block_from_c_bh+0x2d8) [0x7f0f00046ee8] vm.c:1044
/usr/local/lib/libruby.so.2.7(yield_under+0x1b1) [0x7f0f000473c1] vm.c:1171
/usr/local/lib/libruby.so.2.7(vm_call_cfunc_with_frame+0x4c) [0x7f0f0002b110] vm_insnhelper.c:2514
/usr/local/lib/libruby.so.2.7(vm_call_cfunc) vm_insnhelper.c:2539
/usr/local/lib/libruby.so.2.7(vm_call_method_each_type+0xec) [0x7f0f0004441c] vm_insnhelper.c:2925
/usr/local/lib/libruby.so.2.7(vm_call_method+0x59) [0x7f0f00044bc9] vm_insnhelper.c:3026
/usr/local/lib/libruby.so.2.7(vm_call_opt_send+0x1ba) [0x7f0f0004525a] vm_insnhelper.c:2661
/usr/local/lib/libruby.so.2.7(vm_sendish+0x22) [0x7f0f0003690c] vm_insnhelper.c:4023
/usr/local/lib/libruby.so.2.7(vm_exec_core) insns.def:782
/usr/local/lib/libruby.so.2.7(rb_vm_exec+0x156) [0x7f0f0003c5e6] vm.c:1920
/usr/local/lib/libruby.so.2.7(invoke_block+0xdc) [0x7f0f0004873b] vm.c:1044
/usr/local/lib/libruby.so.2.7(invoke_iseq_block_from_c) vm.c:1116
/usr/local/lib/libruby.so.2.7(invoke_block_from_c_bh) vm.c:1134
/usr/local/lib/libruby.so.2.7(vm_yield) vm.c:1179
/usr/local/lib/libruby.so.2.7(rb_yield_0) vm_eval.c:1227
/usr/local/lib/libruby.so.2.7(rb_yield_1) vm_eval.c:1233
/usr/local/lib/libruby.so.2.7(rb_yield) vm_eval.c:1243
/usr/local/lib/libruby.so.2.7(rb_array_len+0x0) [0x7f0effdf04d4] array.c:2135
/usr/local/lib/libruby.so.2.7(rb_ary_each) array.c:2134
/usr/local/lib/libruby.so.2.7(vm_call_cfunc_with_frame+0x4c) [0x7f0f0002b110] vm_insnhelper.c:2514
/usr/local/lib/libruby.so.2.7(vm_call_cfunc) vm_insnhelper.c:2539
/usr/local/lib/libruby.so.2.7(vm_sendish+0x22) [0x7f0f0003690c] vm_insnhelper.c:4023
/usr/local/lib/libruby.so.2.7(vm_exec_core) insns.def:782
/usr/local/lib/libruby.so.2.7(rb_vm_exec+0x156) [0x7f0f0003c5e6] vm.c:1920
/usr/local/lib/libruby.so.2.7(invoke_block+0xdc) [0x7f0f00046aeb] vm.c:1044
/usr/local/lib/libruby.so.2.7(invoke_iseq_block_from_c) vm.c:1116
/usr/local/lib/libruby.so.2.7(invoke_block_from_c_bh) vm.c:1134
/usr/local/lib/libruby.so.2.7(vm_yield) vm.c:1179
/usr/local/lib/libruby.so.2.7(rb_yield_0) vm_eval.c:1227
/usr/local/lib/libruby.so.2.7(catch_i) vm_eval.c:2228
/usr/local/lib/libruby.so.2.7(vm_catch_protect+0xb6) [0x7f0f000308a6] vm_eval.c:2310
/usr/local/lib/libruby.so.2.7(rb_catch_obj+0x2e) [0x7f0f000309be] vm_eval.c:2336
/usr/local/lib/libruby.so.2.7(vm_call_cfunc_with_frame+0x4c) [0x7f0f0002b110] vm_insnhelper.c:2514
/usr/local/lib/libruby.so.2.7(vm_call_cfunc) vm_insnhelper.c:2539
/usr/local/lib/libruby.so.2.7(vm_sendish+0x22) [0x7f0f0003690c] vm_insnhelper.c:4023
/usr/local/lib/libruby.so.2.7(vm_exec_core) insns.def:782
/usr/local/lib/libruby.so.2.7(rb_vm_exec+0x156) [0x7f0f0003c5e6] vm.c:1920
/usr/local/lib/libruby.so.2.7(invoke_block+0xdc) [0x7f0f0004873b] vm.c:1044
/usr/local/lib/libruby.so.2.7(invoke_iseq_block_from_c) vm.c:1116
/usr/local/lib/libruby.so.2.7(invoke_block_from_c_bh) vm.c:1134
/usr/local/lib/libruby.so.2.7(vm_yield) vm.c:1179
/usr/local/lib/libruby.so.2.7(rb_yield_0) vm_eval.c:1227
/usr/local/lib/libruby.so.2.7(rb_yield_1) vm_eval.c:1233
/usr/local/lib/libruby.so.2.7(rb_yield) vm_eval.c:1243
/usr/local/lib/libruby.so.2.7(rb_array_len+0x0) [0x7f0effdf04d4] array.c:2135
/usr/local/lib/libruby.so.2.7(rb_ary_each) array.c:2134
/usr/local/lib/libruby.so.2.7(vm_call_cfunc_with_frame+0x4c) [0x7f0f0002b110] vm_insnhelper.c:2514
/usr/local/lib/libruby.so.2.7(vm_call_cfunc) vm_insnhelper.c:2539
/usr/local/lib/libruby.so.2.7(vm_sendish+0x22) [0x7f0f0003690c] vm_insnhelper.c:4023
/usr/local/lib/libruby.so.2.7(vm_exec_core) insns.def:782
/usr/local/lib/libruby.so.2.7(rb_vm_exec+0x156) [0x7f0f0003c5e6] vm.c:1920
/usr/local/lib/libruby.so.2.7(vm_call0_body+0x1a4) [0x7f0f0003e154] vm_eval.c:136
/usr/local/lib/libruby.so.2.7(rb_vm_call0+0xaf) [0x7f0f0003e95f] vm_eval.c:52
/usr/local/lib/libruby.so.2.7(rb_vm_call_kw+0x66) [0x7f0f0003ec16] vm_eval.c:268
/usr/local/lib/libruby.so.2.7(rb_method_call_with_block_kw+0x7a) [0x7f0efff5dbaa] proc.c:2291
/usr/local/lib/libruby.so.2.7(rb_vm_pop_frame+0x0) [0x7f0f000330a5] vm_insnhelper.c:3220
/usr/local/lib/libruby.so.2.7(vm_yield_with_cfunc) vm_insnhelper.c:3221
/usr/local/lib/libruby.so.2.7(vm_invoke_ifunc_block+0x53) [0x7f0f0003326c] vm_insnhelper.c:3381
/usr/local/lib/libruby.so.2.7(vm_invoke_block) vm_insnhelper.c:3421
/usr/local/lib/libruby.so.2.7(vm_invoke_block_opt_call) vm_insnhelper.c:2680
/usr/local/lib/libruby.so.2.7(vm_sendish+0x21) [0x7f0f00036855] vm_insnhelper.c:4023
/usr/local/lib/libruby.so.2.7(vm_exec_core) insns.def:801
/usr/local/lib/libruby.so.2.7(rb_vm_exec+0x156) [0x7f0f0003c5e6] vm.c:1920
/usr/local/lib/libruby.so.2.7(invoke_iseq_block_from_c+0x139) [0x7f0f0003de68] vm.c:1116
/usr/local/lib/libruby.so.2.7(invoke_block_from_c_proc) vm.c:1216
/usr/local/lib/libruby.so.2.7(vm_invoke_proc) vm.c:1238
/usr/local/lib/libruby.so.2.7(rb_vm_invoke_proc) vm.c:1259
/usr/local/lib/libruby.so.2.7(thread_do_start+0x19d) [0x7f0effff986d] thread.c:697
/usr/local/lib/libruby.so.2.7(thread_start_func_2+0x25f) [0x7f0effffba2f] thread.c:745
/usr/local/lib/libruby.so.2.7(rb_native_cond_initialize+0x0) [0x7f0effffbf7c] thread_pthread.c:969
/usr/local/lib/libruby.so.2.7(register_cached_thread_and_wait) thread_pthread.c:1021
/usr/local/lib/libruby.so.2.7(thread_start_func_1) thread_pthread.c:976
/lib/x86_64-linux-gnu/libpthread.so.0(0x7fa3) [0x7f0effb77fa3]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f0eff6084cf]
@stanhu
Copy link
Contributor Author

stanhu commented May 4, 2021

@haberman
Copy link
Member

haberman commented May 4, 2021

Thanks for the thorough report. Yes debugging symbols or a repro would help a lot here.

@tnir
Copy link

tnir commented May 12, 2021

They are using grpc-1.30.2 https://github.com/grpc/grpc/releases/tag/v1.30.2 . Why not using grpc latest (1.35.x)?

@tnir
Copy link

tnir commented May 12, 2021

@stanhu Your bug report is not consistent with the log you provided:

Version: master/v3.6.0/v3.5.0 etc.
Language: Ruby/v3.15.8

Version: v3.15.8
Language: Ruby 2.7.2

Also you can add grpc version: 1.30.2.

@stanhu
Copy link
Contributor Author

stanhu commented May 12, 2021

That was just a leftover from the template. The information was in the title.

@haberman
Copy link
Member

Is there any chance you could reproduce this using a build of protobuf with symbols (and ideally debug symbols)?

If you checkout this repo and do cd ruby && rake compile, you should be able to point your RUBYLIB at <repo-base>/ruby/lib.

I'm also interested in seeing if we can get symbols into our binary gems, but for now this would be the easiest way of getting a better stack trace.

@stanhu
Copy link
Contributor Author

stanhu commented May 19, 2021

@haberman Our team is trying to get this, but it's been tricky to reproduce the issue in CI. We tried doing this:

gem uninstall google-protobuf
gem install google-protobuf -v 3.15.8 --platform=ruby -- --with-cflags=-ggdb3

I think that's enough to get debug symbols, but we didn't see anything in the Ruby backtrace:

-- C level backtrace information -------------------------------------------
/usr/local/lib/libruby.so.2.7(rb_vm_bugreport+0x562) [0x7ff97fd6db72] vm_dump.c:755
/usr/local/lib/libruby.so.2.7(rb_bug_for_fatal_signal+0xef) [0x7ff97fba129f] error.c:660
/usr/local/lib/libruby.so.2.7(sigsegv+0x52) [0x7ff97fcd48b2] signal.c:946
/lib/x86_64-linux-gnu/libpthread.so.0(__restore_rt+0x0) [0x7ff97f8a3730]
/builds/gitlab-org/security/gitlab/vendor/ruby/2.7.0/gems/google-protobuf-3.15.8-x86_64-linux/lib/google/2.7/protobuf_c.so(0x7ff967cf63f1) [0x7ff967cf63f1]
/builds/gitlab-org/security/gitlab/vendor/ruby/2.7.0/gems/google-protobuf-3.15.8-x86_64-linux/lib/google/2.7/protobuf_c.so(0x7ff967cf6ca5) [0x7ff967cf6ca5]
/builds/gitlab-org/security/gitlab/vendor/ruby/2.7.0/gems/google-protobuf-3.15.8-x86_64-linux/lib/google/2.7/protobuf_c.so(0x7ff967cedae7) [0x7ff967cedae7]
/usr/local/lib/libruby.so.2.7(vm_call_cfunc_with_frame+0x4c) [0x7ff97fd4c110] vm_insnhelper.c:2514
/usr/local/lib/libruby.so.2.7(vm_call_cfunc) vm_insnhelper.c:2539
/usr/local/lib/libruby.so.2.7(vm_call_method_each_type+0xec) [0x7ff97fd6541c] vm_insnhelper.c:2925
/usr/local/lib/libruby.so.2.7(vm_call_method+0x59) [0x7ff97fd65bc9] vm_insnhelper.c:3026
/usr/local/lib/libruby.so.2.7(vm_call_opt_send+0x1ba) [0x7ff97fd6625a] vm_insnhelper.c:2661
/usr/local/lib/libruby.so.2.7(vm_sendish+0x21) [0x7ff97fd57855] vm_insnhelper.c:4023
/usr/local/lib/libruby.so.2.7(vm_exec_core) insns.def:801
/usr/local/lib/libruby.so.2.7(rb_vm_exec+0x156) [0x7ff97fd5d5e6] vm.c:1920
/usr/local/lib/libruby.so.2.7(rb_ec_vm_ptr+0x0) [0x7ff97fd5e56e] vm.c:1074
/usr/local/lib/libruby.so.2.7(rb_vm_global_hooks) vm_core.h:1949
/usr/local/lib/libruby.so.2.7(invoke_bmethod) vm.c:1076
/usr/local/lib/libruby.so.2.7(invoke_iseq_block_from_c) vm.c:1119
/usr/local/lib/libruby.so.2.7(invoke_block_from_c_proc) vm.c:1216
/usr/local/lib/libruby.so.2.7(rb_vm_invoke_bmethod) vm.c:1245
/usr/local/lib/libruby.so.2.7(vm_call_bmethod+0x93) [0x7ff97fd652e3] vm_insnhelper.c:2570

We did get a core dump. We may have to save the protobuf_c.so as well, but when I swapped it in with a compiled version in the same environment, I got this. I'm not sure if this is an improvement, but the ruby-upb.c line is intriguing:

Thread 1 (Thread 0x7ff9221f8700 (LWP 752)):
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ff97f252535 in __GI_abort () at abort.c:79
#2  0x00007ff97fb0c7a8 in die () at error.c:664
#3  rb_bug_for_fatal_signal (default_sighandler=0x0, sig=sig@entry=11, ctx=ctx@entry=0x7ff922444a80, fmt=fmt@entry=0x7ff97fdb0fcb "Segmentation fault at %p") at error.c:664
#4  0x00007ff97fcd48b2 in sigsegv (sig=11, info=0x7ff922444bb0, ctx=0x7ff922444a80) at signal.c:946
#5  <signal handler called>
#6  0x00007ff967cf63f1 in jsondec_streql (str=..., lit=0x7ff948619150 "a ") at ruby-upb.c:6776
#7  0x00007ff9221f3638 in ?? ()
#8  0x00007ff9231b7998 in ?? ()
#9  0x00007ff96c3d93a0 in ?? ()
#10 0x00007ff9220f3c10 in ?? ()
#11 0x00007ff97fd477b2 in vm_trace (ec=0x7ff96c3740a0, reg_cfp=0x3, pc=<optimized out>) at vm_insnhelper.c:4822
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)

@stanhu
Copy link
Contributor Author

stanhu commented May 20, 2021

Details from the core dump:

<snip>
(gdb) up
#6  0x00007ff967cf63f1 in jsondec_streql (str=..., lit=0x7ff948619150 "a ") at ruby-upb.c:6776
6776      return str.size == strlen(lit) && memcmp(str.data, lit, str.size) == 0;
(gdb) p lit
$1 = 0x7ff948619150 "a "
(gdb) p str.size
$2 = 140707995008080
(gdb) p strlen(lit)
You can't do that without a process to debug.
(gdb) p str.data
$3 = 0x87f8dcf44e7d0 <error: Cannot access memory at address 0x87f8dcf44e7d0>
(gdb) pt str.size
type = unsigned long
(gdb) p str
$4 = {data = 0x87f8dcf44e7d0 <error: Cannot access memory at address 0x87f8dcf44e7d0>, size = 140707995008080}

Is str.size corrupt here?

@haberman
Copy link
Member

I think [-ggdb3] is enough to get debug symbols, but we didn't see anything in the Ruby backtrace:

It seems like that should be more that enough. Any build that does not actively call strip at the end should at least get us a function-level stack trace, with -g we should get line numbers.

If you're not getting a good stack trace with that command, it seems that either you're still getting the stripped prebuilt or the local build process is stripping the .so somehow.

I'm not sure if this is an improvement, but the ruby-upb.c line is intriguing:

The jsondec_streql() frame is inside the JSON decoder. Unless you were decoding JSON, I don't think this is a reliable symbol unfortunately. I thought your crash from before was inside an :encode call, ie. encoding to binary?

@stanhu
Copy link
Contributor Author

stanhu commented May 20, 2021

Yes, it is. I found that the gem uninstall and gem install didn't uninstall/install the library in the right place for a bundle install configured with a vendor path. I'm now manually tweaking this and re-running our tests.

@stanhu
Copy link
Contributor Author

stanhu commented May 21, 2021

Ok, we have a real stack trace now!

-- Machine register context ------------------------------------------------
 RIP: 0x00007fa1154a374e RBP: 0x000879b09d713ab0 RSP: 0x00007fa0e56f21d0
 RAX: 0x0007fa0fa33d6300 RBX: 0x00007fa0fa33d7c0 RCX: 0x00007fa1154b752c
 RDX: 0x00007fa1154a36f7 RDI: 0x00007fa1154b74e0 RSI: 0x00007fa0e5168d90
  R8: 0x0000000000000001  R9: 0x00007fa0ef790280 R10: 0xffffffffffffffa0
 R11: 0x0000000000000158 R12: 0x00007fa0fe2c1e10 R13: 0x0000000000000003
 R14: 0x00007fa111ee1dd0 R15: 0x00007fa0e56f22e0 EFL: 0x0000000000010202

-- C level backtrace information -------------------------------------------
/usr/local/lib/libruby.so.2.7(rb_vm_bugreport+0x562) [0x7fa12d503b72] vm_dump.c:755
/usr/local/lib/libruby.so.2.7(rb_bug_for_fatal_signal+0xef) [0x7fa12d33729f] error.c:660
/usr/local/lib/libruby.so.2.7(sigbus+0x52) [0x7fa12d46a912] signal.c:932
/lib/x86_64-linux-gnu/libpthread.so.0(__restore_rt+0x0) [0x7fa12d039730]
/builds/gitlab-org/security/gitlab/vendor/ruby/2.7.0/gems/google-protobuf-3.15.8/lib/google/protobuf_c.so(encode_array+0x546) [0x7fa1154a374e] ruby-upb.c:1219
/builds/gitlab-org/security/gitlab/vendor/ruby/2.7.0/gems/google-protobuf-3.15.8/lib/google/protobuf_c.so(encode_message) ruby-upb.c:1345
/builds/gitlab-org/security/gitlab/vendor/ruby/2.7.0/gems/google-protobuf-3.15.8/lib/google/protobuf_c.so(upb_encode_ex+0xa2) [0x7fa1154a3a22] ruby-upb.c:1374
/builds/gitlab-org/security/gitlab/vendor/ruby/2.7.0/gems/google-protobuf-3.15.8/lib/google/protobuf_c.so(upb_encode+0x17) [0x7fa11549b146] ruby-upb.h:1874
/builds/gitlab-org/security/gitlab/vendor/ruby/2.7.0/gems/google-protobuf-3.15.8/lib/google/protobuf_c.so(Message_encode) message.c:1013
/usr/local/lib/libruby.so.2.7(vm_call_cfunc_with_frame+0x4c) [0x7fa12d4e2110] vm_insnhelper.c:2514
/usr/local/lib/libruby.so.2.7(vm_call_cfunc) vm_insnhelper.c:2539
/usr/local/lib/libruby.so.2.7(vm_call_method_each_type+0xec) [0x7fa12d4fb41c] vm_insnhelper.c:2925
/usr/local/lib/libruby.so.2.7(vm_call_method+0x59) [0x7fa12d4fbbc9] vm_insnhelper.c:3026
/usr/local/lib/libruby.so.2.7(vm_call_opt_send+0x1ba) [0x7fa12d4fc25a] vm_insnhelper.c:2661
/usr/local/lib/libruby.so.2.7(vm_sendish+0x21) [0x7fa12d4ed855] vm_insnhelper.c:4023
/usr/local/lib/libruby.so.2.7(vm_exec_core) insns.def:801
/usr/local/lib/libruby.so.2.7(rb_vm_exec+0x156) [0x7fa12d4f35e6] vm.c:1920
/usr/local/lib/libruby.so.2.7(rb_ec_vm_ptr+0x0) [0x7fa12d4f456e] vm.c:1074
/usr/local/lib/libruby.so.2.7(rb_vm_global_hooks) vm_core.h:1949
/usr/local/lib/libruby.so.2.7(invoke_bmethod) vm.c:1076
/usr/local/lib/libruby.so.2.7(invoke_iseq_block_from_c) vm.c:1119
/usr/local/lib/libruby.so.2.7(invoke_block_from_c_proc) vm.c:1216
/usr/local/lib/libruby.so.2.7(rb_vm_invoke_bmethod) vm.c:1245
/usr/local/lib/libruby.so.2.7(vm_call_bmethod+0x93) [0x7fa12d4fb2e3] vm_insnhelper.c:2570
/usr/local/lib/libruby.so.2.7(vm_call_method_each_type+0x30c) [0x7fa12d4fb63c] vm_insnhelper.c:2956
/usr/local/lib/libruby.so.2.7(vm_call_method+0x59) [0x7fa12d4fbbc9] vm_insnhelper.c:3026
/usr/local/lib/libruby.so.2.7(vm_call_opt_send+0x1ba) [0x7fa12d4fc25a] vm_insnhelper.c:2661
/usr/local/lib/libruby.so.2.7(vm_sendish+0x21) [0x7fa12d4ed855] vm_insnhelper.c:4023

@stanhu
Copy link
Contributor Author

stanhu commented May 21, 2021

More details from the core dump:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fa12c9e8535 in __GI_abort () at abort.c:79
#2  0x00007fa12d2a27a8 in die () at error.c:664
#3  rb_bug_for_fatal_signal (default_sighandler=0x0, sig=sig@entry=7, ctx=ctx@entry=0x7fa0e9c09bc0, fmt=fmt@entry=0x7fa12d546fe8 "Bus Error at %p") at error.c:664
#4  0x00007fa12d46a912 in sigbus (sig=7, info=0x7fa0e9c09cf0, ctx=0x7fa0e9c09bc0) at signal.c:932
#5  <signal handler called>
#6  encode_array (f=0x7fa111ee1dd0, m=0x7fa111e7a120, field_mem=<optimized out>, e=0x7fa0e56f22e0) at ruby-upb.c:1220
#7  encode_message (e=e@entry=0x7fa0e56f22e0, msg=msg@entry=0x7fa0e5168798 "\022", m=m@entry=0x7fa111e7a120, size=size@entry=0x7fa0e56f2408) at ruby-upb.c:1345
#8  0x00007fa1154a3a22 in upb_encode_ex (msg=0x7fa0e5168798, l=0x7fa111e7a120, options=options@entry=0, arena=arena@entry=0x7fa0fe2c1e90, size=size@entry=0x7fa0e56f2408) at ruby-upb.c:1374
#9  0x00007fa11549b146 in upb_encode (size=0x7fa0e56f2408, arena=0x7fa0fe2c1e90, l=<optimized out>, msg=<optimized out>) at ruby-upb.h:1874
#10 Message_encode (klass=140329767403240, msg_rb=140329132679280) at message.c:1013
#11 0x00007fa12d4e2110 in vm_call_cfunc_with_frame (empty_kw_splat=<optimized out>, cd=0x7fa0e56f2620, calling=<optimized out>, reg_cfp=0x7fa0e57f2e28, ec=0x7fa109a35450) at vm_insnhelper.c:2514
#12 vm_call_cfunc (ec=0x7fa109a35450, reg_cfp=0x7fa0e57f2e28, calling=<optimized out>, cd=0x7fa0e56f2620) at vm_insnhelper.c:2539
#13 0x00007fa12d4fb41c in vm_call_method_each_type (ec=0x7fa109a35450, cfp=0x7fa0e57f2e28, calling=0x7fa0e56f2720, cd=0x7fa0e56f2620) at vm_insnhelper.c:2925
#14 0x00007fa12d4fbbc9 in vm_call_method_each_type (cd=<optimized out>, calling=<optimized out>, cfp=<optimized out>, ec=<optimized out>) at vm_insnhelper.c:3026
#15 vm_call_method (ec=0x7fa109a35450, cfp=0x7fa0e57f2e28, calling=<optimized out>, cd=<optimized out>) at vm_insnhelper.c:3053
#16 0x00007fa12d4fc25a in vm_call_opt_send (ec=0x7fa109a35450, reg_cfp=0x7fa0e57f2e28, calling=0x7fa0e56f2720, orig_cd=<optimized out>) at vm_insnhelper.c:2661
#17 0x00007fa12d4ed855 in vm_sendish (block_handler=<optimized out>, method_explorer=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at vm_insnhelper.c:4023
#18 vm_exec_core (ec=0x7fa1154b74e0, initial=140329014955408) at insns.def:801
#19 0x00007fa12d4f35e6 in rb_vm_exec (ec=0x7fa109a35450, mjit_enable_p=1) at vm.c:1920
#20 0x00007fa12d4f456e in invoke_bmethod (captured=0x8, opt_pc=<optimized out>, type=<optimized out>, me=0x7fa111f87a78, self=140329291336280, iseq=<optimized out>, ec=0x7fa109a35450) at vm.c:1074
#21 invoke_iseq_block_from_c (me=0x7fa111f87a78, is_lambda=<optimized out>, cref=0x0, passed_block_handler=0, kw_splat=<optimized out>, argv=<optimized out>, argc=<optimized out>, self=140329291336280, captured=0x8,
    ec=0x7fa109a35450) at vm.c:1119
#22 invoke_block_from_c_proc (me=0x7fa111f87a78, is_lambda=<optimized out>, passed_block_handler=0, kw_splat=<optimized out>, argv=<optimized out>, argc=<optimized out>, self=140329291336280, proc=<optimized out>,
    ec=0x7fa109a35450) at vm.c:1216
#23 rb_vm_invoke_bmethod (ec=0x7fa109a35450, proc=<optimized out>, self=140329291336280, argc=<optimized out>, argv=<optimized out>, kw_splat=<optimized out>, block_handler=0, me=0x7fa111f87a78) at vm.c:1245
#24 0x00007fa12d4fb2e3 in vm_call_bmethod_body (argv=<optimized out>, cd=0x7fa0e56f2b40, calling=0x7fa0e56f2c40, ec=0x7fa109a35450) at vm_insnhelper.c:2570
#25 vm_call_bmethod (ec=ec@entry=0x7fa109a35450, cfp=cfp@entry=0x7fa0e57f2f78, calling=calling@entry=0x7fa0e56f2c40, cd=cd@entry=0x7fa0e56f2b40) at vm_insnhelper.c:2590
#26 0x00007fa12d4fb63c in vm_call_method_each_type (ec=0x7fa109a35450, cfp=0x7fa0e57f2f78, calling=0x7fa0e56f2c40, cd=0x7fa0e56f2b40) at vm_insnhelper.c:2956
#27 0x00007fa12d4fbbc9 in vm_call_method_each_type (cd=<optimized out>, calling=<optimized out>, cfp=<optimized out>, ec=<optimized out>) at vm_insnhelper.c:3026
#28 vm_call_method (ec=0x7fa109a35450, cfp=0x7fa0e57f2f78, calling=<optimized out>, cd=<optimized out>) at vm_insnhelper.c:3053
#29 0x00007fa12d4fc25a in vm_call_opt_send (ec=0x7fa109a35450, reg_cfp=0x7fa0e57f2f78, calling=0x7fa0e56f2c40, orig_cd=<optimized out>) at vm_insnhelper.c:2661
#30 0x00007fa12d4ed855 in vm_sendish (block_handler=<optimized out>, method_explorer=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at vm_insnhelper.c:4023
#31 vm_exec_core (ec=0x7fa1154b74e0, initial=140329014955408) at insns.def:801
#32 0x00007fa12d4f35e6 in rb_vm_exec (ec=0x7fa109a35450, mjit_enable_p=1) at vm.c:1920
#33 0x00007fa12d4f5154 in vm_call0_body (ec=0x7fa109a35450, calling=0x7fa0e56f2ea0, cd=<optimized out>, argv=<optimized out>) at vm_eval.c:136
#34 0x00007fa12d4f595f in rb_vm_call0 (ec=ec@entry=0x7fa109a35450, recv=recv@entry=140329274151520, id=id@entry=22065, argc=<optimized out>, argv=<optimized out>, me=me@entry=0x7fa11282cb38, kw_splat=0)
    at vm_eval.c:52
#35 0x00007fa12d4f5c16 in rb_vm_call_kw (ec=ec@entry=0x7fa109a35450, recv=recv@entry=140329274151520, id=22065, argc=<optimized out>, argc@entry=1, argv=<optimized out>, argv@entry=0x7fa0e56fb588, me=0x7fa11282cb38,
    kw_splat=<optimized out>) at vm_eval.c:268
#36 0x00007fa12d4f606d in rb_call0 (ec=ec@entry=0x7fa109a35450, recv=recv@entry=140329274151520, mid=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fa0e56fb588, call_scope=call_scope@entry=CALL_PUBLIC,
    self=<optimized out>) at vm_eval.c:392
#37 0x00007fa12d4f62f5 in send_internal (argc=1, argv=0x7fa0e56fb588, recv=recv@entry=140329274151520, scope=CALL_PUBLIC) at vm_eval.c:1135
#38 0x00007fa12d4f64c0 in send_internal_kw (scope=<optimized out>, recv=140329274151520, argv=<optimized out>, argc=<optimized out>) at vm_eval.c:1158
#39 rb_f_public_send (argc=<optimized out>, argv=<optimized out>, recv=140329274151520) at vm_eval.c:1210
#40 0x00007fa12d4e2110 in vm_call_cfunc_with_frame (empty_kw_splat=<optimized out>, cd=0x7fa1075890b0, calling=<optimized out>, reg_cfp=0x7fa0e57f31a8, ec=0x7fa109a35450) at vm_insnhelper.c:2514
#41 vm_call_cfunc (ec=0x7fa109a35450, reg_cfp=0x7fa0e57f31a8, calling=<optimized out>, cd=0x7fa1075890b0) at vm_insnhelper.c:2539
#42 0x00007fa12d4ed90c in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at vm_insnhelper.c:4023
#43 vm_exec_core (ec=0x7fa1154b74e0, initial=140329014955408) at insns.def:782
#44 0x00007fa12d4f35e6 in rb_vm_exec (ec=0x7fa109a35450, mjit_enable_p=1) at vm.c:1920

If i try to look at the state:

(gdb) up
#6  encode_array (f=0x7fa111ee1dd0, m=0x7fa111e7a120, field_mem=<optimized out>, e=0x7fa0e56f22e0) at ruby-upb.c:1220
1220    ruby-upb.c: No such file or directory.
(gdb) p f
$1 = (const upb_msglayout_field *) 0x7fa111ee1dd0
(gdb) p *f
$2 = {number = 5, offset = 40, presence = 0, submsg_index = 0, descriptortype = 12 '\f', label = 3 '\003'}
(gdb) p *m
$3 = {submsgs = 0x7fa111ee1e60, fields = 0x7fa111ee1da0, size = 120, field_count = 16, extendable = false, table_mask = 0 '\000', fasttable = 0x7fa111e7a138}
(gdb) p *e
$4 = {err = {{__jmpbuf = {140329132679280, -568946320671323410, 140329012979296, 140329767403240, 1431634051, 140329021812264, -568996358774666514, -569313266418136338}, __mask_was_saved = 0, __saved_mask = {__val = {
          572653601, 140329020761168, 140330226628070, 140329901323840, 0, 8144233504, 140330226635092, 20, 0, 140329628161104, 140329819479665, 0, 140329020761168, 140329819476824, 140330224962631, 8}}}},
  alloc = 0x7fa0fe2c1e90, buf = 0x7fa0fe2c1d90 "", ptr = 0x7fa0fe2c1e0a "@\001z\002\b\001", limit = 0x7fa0fe2c1e10 "", options = 0, depth = 64, sorter = {entries = 0x0, size = 0, cap = 0}}
(gdb) p start
$5 = (const upb_strview *) 0x7fa0fa33d7c0
(gdb) p arr->len
value has been optimized out
(gdb) p *ptr
Cannot access memory at address 0x879b09d713ab0
(gdb) p arr
$13 = <optimized out>
(gdb) p *start
$14 = {data = 0x2061 <error: Cannot access memory at address 0x2061>, size = 140330075065280}
(gdb) up
#7  encode_message (e=e@entry=0x7fa0e56f22e0, msg=msg@entry=0x7fa0e5168798 "\022", m=m@entry=0x7fa111e7a120, size=size@entry=0x7fa0e56f2408) at ruby-upb.c:1345
1345    in ruby-upb.c
(gdb) p *e
$5 = {err = {{__jmpbuf = {140329132679280, -568946320671323410, 140329012979296, 140329767403240, 1431634051, 140329021812264, -568996358774666514, -569313266418136338}, __mask_was_saved = 0, __saved_mask = {__val = {
          572653601, 140329020761168, 140330226628070, 140329901323840, 0, 8144233504, 140330226635092, 20, 0, 140329628161104, 140329819479665, 0, 140329020761168, 140329819476824, 140330224962631, 8}}}},
  alloc = 0x7fa0fe2c1e90, buf = 0x7fa0fe2c1d90 "", ptr = 0x7fa0fe2c1e0a "@\001z\002\b\001", limit = 0x7fa0fe2c1e10 "", options = 0, depth = 64, sorter = {entries = 0x0, size = 0, cap = 0}}
(gdb) p *msg
$6 = 18 '\022'
(gdb) p size
$7 = (size_t *) 0x7fa0e56f2408
(gdb) p *size
$8 = 140329628161104

@stanhu
Copy link
Contributor Author

stanhu commented May 21, 2021

More context: the protobuf encoding was attempted by this gRPC call:

-- Control frame information -----------------------------------------------
c:0324 p:---- s:2191 e:002190 CFUNC  :encode
c:0323 p:0012 s:2186 e:002185 BLOCK  /builds/gitlab-org/security/gitlab/vendor/ruby/2.7.0/gems/grpc-1.30.2/src/ruby/lib/grpc/generic/rpc_desc.rb:35
c:0322 p:0029 s:2182 e:002180 METHOD /builds/gitlab-org/security/gitlab/vendor/ruby/2.7.0/gems/grpc-1.30.2/src/ruby/lib/grpc/generic/active_call.rb:438
c:0321 p:0013 s:2171 e:002170 BLOCK  /builds/gitlab-org/security/gitlab/vendor/ruby/2.7.0/gems/grpc-1.30.2/src/ruby/lib/grpc/generic/client_stub.rb:347
c:0320 p:0013 s:2168 e:002167 METHOD /builds/gitlab-org/security/gitlab/vendor/ruby/2.7.0/gems/grpc-1.30.2/src/ruby/lib/grpc/generic/interceptors.rb:170
c:0319 p:0093 s:2161 e:002160 METHOD /builds/gitlab-org/security/gitlab/vendor/ruby/2.7.0/gems/grpc-1.30.2/src/ruby/lib/grpc/generic/client_stub.rb:346
c:0318 p:0070 s:2142 e:002141 BLOCK  /builds/gitlab-org/security/gitlab/vendor/ruby/2.7.0/gems/grpc-1.30.2/src/ruby/lib/grpc/generic/service.rb:181 [FINISH]
c:0317 p:0063 s:2136 e:002135 METHOD /builds/gitlab-org/security/gitlab/lib/gitlab/gitaly_client.rb:177
c:0316 p:0034 s:2124 e:002123 BLOCK  /builds/gitlab-org/security/gitlab/lib/gitlab/gitaly_client/call.rb:18
c:0315 p:0024 s:2121 e:002120 METHOD /builds/gitlab-org/security/gitlab/lib/gitlab/gitaly_client/call.rb:55
c:0314 p:0004 s:2117 e:002116 METHOD /builds/gitlab-org/security/gitlab/lib/gitlab/gitaly_client/call.rb:17
c:0313 p:0047 s:2110 e:002109 METHOD /builds/gitlab-org/security/gitlab/lib/gitlab/gitaly_client.rb:167
c:0312 p:0270 s:2098 e:002097 METHOD /builds/gitlab-org/security/gitlab/lib/gitlab/gitaly_client/commit_service.rb:351
c:0311 p:0008 s:2091 e:002090 BLOCK  /builds/gitlab-org/security/gitlab/lib/gitlab/git/repository.rb:355
c:0310 p:0005 s:2088 e:002087 METHOD /builds/gitlab-org/security/gitlab/lib/gitlab/git/wraps_gitaly_errors.rb:7
c:0309 p:0100 s:2082 E:000838 METHOD /builds/gitlab-org/security/gitlab/lib/gitlab/git/repository.rb:354 [FINISH]

I believe the protobuf definition is:

message FindCommitsRequest {
  Repository repository = 1 [(target_repository)=true];
  bytes revision = 2;
  int32 limit = 3;
  int32 offset = 4;
  repeated bytes paths = 5;
  bool follow = 6;
  bool skip_merges = 7;
  bool disable_walk = 8;
  google.protobuf.Timestamp after = 9;
  google.protobuf.Timestamp before = 10;
  // all and revision are mutually exclusive
  bool all = 11;
  bool first_parent = 12;
  bytes author = 13;
  enum Order {
    NONE = 0;
    TOPO = 1;
  }
  Order order = 14;
  GlobalOptions global_options = 15;
  bool trailers = 16;
}

With the gRPC definition:

proto
  rpc FindCommits(FindCommitsRequest) returns (stream FindCommitsResponse) {
    option (op_type) = {
      op: ACCESSOR
    };
  }

Likely this is an encoding issue with repeated bytes paths?

@haberman
Copy link
Member

Thanks for this excellent information!

Looking at the code in question, it is relatively straightforward and there is not a lot of room for a bug to live. It seems likely that the bug is not in this code per se, but that the proto was somehow corrupted before reaching this point.

Is there any chance the bug would reproduce under Valgrind? That could give some extra information to validate the status of the memory that the encoder is trying to read.

@stanhu
Copy link
Contributor Author

stanhu commented May 21, 2021

Unfortunately, I don't have a concrete reproduction step yet to use Valgrind. Right now I can only retry our build process, which takes an hour to finish.

Will compiling this Ruby extension with AddressSanitizer via -fsanitize=address help? Let me find out.

@haberman
Copy link
Member

I think -fsanitize=address could help, but you might have to build Ruby with this flag also to make it work. I generally have had trouble getting ASAN to work for Ruby extensions.

I'll brainstorm some more the best way to debug this, I don't want to send you on too many wild goose chases.

Does the issue repro fairly quickly once the process has started, or is it more intermittent?

@stanhu
Copy link
Contributor Author

stanhu commented May 22, 2021

Does the issue repro fairly quickly once the process has started, or is it more intermittent?

It's intermittent but annoying enough that happens a few times a day. It definitely happened with the upgrade to v3.15.8. We reverted to v3.14.0 and never hit the error. We had to upgrade to v3.15.8 a few weeks after the revert, and now we see the issue again. We just retry the build whenever it happens.

I'm not sure if it ever happens in production, though. I'll have to look for that.

@haberman
Copy link
Member

Ah I see, if it's that intermittent then debug printf might not be the best way to tackle this one.

Yes it totally makes sense that 3.15 would have been the point where this was introduced. There was a major change in 3.15, a rewrite of the data layer basically (#8184), which improved performance and simplified the code a lot. But like all new code, it's had some bugs to work through. Thanks for your patience while we figure them out.

Can you give a summary of how this message (the one that is crashing in #encode) is built? Is it built from scratch, or was it originally parsed from a different payload? Is there anything particularly interesting about the way that it is built? In particular, how are the values in this repeated bytes field (repeated bytes paths) populated?

@stanhu
Copy link
Contributor Author

stanhu commented May 22, 2021

The message is built from scratch: https://gitlab.com/gitlab-org/gitlab/-/blob/ffed5fe4139bb4a49cf1b262dfabff16b7dd38fc/lib/gitlab/gitaly_client/commit_service.rb#L331-351

Line 349 is where the repeated call happens. In the test that hit this effort above, I think the value of options[:path] should just be README.md or some other filename.

The encode_binary just forces a Ruby string to a binary encoding: https://gitlab.com/gitlab-org/gitlab/-/blob/ffed5fe4139bb4a49cf1b262dfabff16b7dd38fc/lib/gitlab/encoding_helper.rb#L89-91

@haberman
Copy link
Member

Thank you for the code reference; that was enough of a lead that I was able to find the bug!

See the attached PR, which I am confident will fix the crashes you are seeing. I was able to reproduce the issue under Valgrind and verify that my fix removes the Valgrind error.

This fix should be released in 3.17.1, which I expect will be released on Monday or Tuesday.

@stanhu
Copy link
Contributor Author

stanhu commented May 22, 2021

@haberman Awesome, thank you very much!

@haberman
Copy link
Member

@stanhu You're welcome!

Small update: this will be in 3.17.2, not 3.17.1. The 3.17.1 release was already in progress when I submitted this. But 3.17.2. should still be released in the next day or two.

@stanhu
Copy link
Contributor Author

stanhu commented May 24, 2021

@haberman I think 3.17.1 has this fix. 😄

commit 367e4691d2bc97de28b422d7461e4135d556d301 (tag: v3.17.1, origin/3.17.x)
Author: Joshua Haberman <jhaberman@gmail.com>
Date:   Fri May 21 23:04:09 2021 -0700

    Fixed memory bug: properly root repeated/map field when assigning. (#8639)

    * Fixed memory bug: properly root repeated/map field when assigning.

    Previously the protobuf extension would not properly root
    memory from a repeated field or map when assigning to a
    message field (see the attached test case).  This could cause
    crashes if the repeated field is subsequently accessed.

    * Add accidentally-deleted Ruby test.

@haberman
Copy link
Member

Excellent, even better. :)

@stanhu
Copy link
Contributor Author

stanhu commented May 24, 2021

Now I wonder how long it takes for this release to land in Rubygems. https://rubygems.org/gems/google-protobuf

@haberman
Copy link
Member

I think it is up now: https://rubygems.org/gems/google-protobuf/versions/3.17.1-x86_64-linux

@stanhu
Copy link
Contributor Author

stanhu commented Jun 14, 2021

Just to follow up here, I have not see a seg fault since we upgraded to v3.17.1, so thanks for all the work here. I should have remembered that funny things can happen during garbage collection.

stanhu added a commit to stanhu/fluent-plugin-google-cloud that referenced this issue Jun 23, 2021
google-protobuf v3.15.x and v3.16.x can seg fault if repeated fields in
a Hash are garbage collected:
protocolbuffers/protobuf#8559
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants