Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disappearing semaphore array #224

Open
viraptor opened this issue May 27, 2019 · 2 comments
Open

Disappearing semaphore array #224

viraptor opened this issue May 27, 2019 · 2 comments

Comments

@viraptor
Copy link
Contributor

I've run into an issue with a semaphore array disappearing from the system while the app is running. I can't find any details in the logs about what would cause it, but it started happening pretty much as we moved from ubuntu 14.04 to 18.04. There are no other changes that I could see that would be related here.

The system is running with ruby 2.5.5. The exception we get is:

Semian::SyscallError: semop() failed, errno: 22 (Invalid argument)

File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/protected_resource.rb line 50 in acquire
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/protected_resource.rb line 50 in acquire_bulkhead
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/protected_resource.rb line 24 in block in acquire
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/protected_resource.rb line 38 in block in acquire_circuit_breaker
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/circuit_breaker.rb line 141 in maybe_with_half_open_resource_timeout
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/circuit_breaker.rb line 30 in acquire
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/protected_resource.rb line 37 in acquire_circuit_breaker
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/protected_resource.rb line 23 in acquire
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/adapter.rb line 34 in acquire_semian_resource
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/net_http.rb line 83 in connect

This is with the latest released semian.

The issue starts occurring a number of hours after the deployment, without any obvious pattern of traffic.

I tracked the call down to:

10574.300 ( 0.015 ms): ruby/21041 semtimedop(semid: 131072, tsops: 0x7ffff92379c2, nsops: 1, timeout: 0x7ffff9237aa8) = -1 EINVAL Invalid argument

where the semid: 131072 doesn't exist on the system (normally we have 2 semaphore arrays, but this system had only 1). This was validated using ipcs -s.

Please let me know if there's any more debugging information I can provide.

@viraptor
Copy link
Contributor Author

It turns out there was a bit of misunderstanding of what happened. Additional details:

The rails app which was affected was is normally configured for 6 tickets and runs with 9 workers. The healthy status looks like this:

Semaphore Array semid=196608
uid=1001	 gid=1001	 cuid=1001	 cgid=1001
mode=0660, access_perms=0660
nsems = 4
otime = Mon May 27 18:25:46 2019
ctime = Mon May 27 18:25:46 2019
semnum     value      ncount     zcount     pid
0          1          0          0          14442
1          6          0          0          14442
2          6          0          0          14442
3          1          0          0          14442

For the affected app, the semid was incorrect, but also the semaphore array present at the instance was:

Semaphore Array semid=196608
uid=1001     gid=1001     cuid=1001     cgid=1001
mode=0660, access_perms=0660
nsems = 4
otime = Mon May 27 12:05:28 2019
ctime = Mon May 27 12:00:03 2019
semnum     value      ncount     zcount     pid
0          1          0          0          1225
1          1          0          0          1225
2          1          0          0          1225
3          0          0          0          1225

Where pid 1225 did not exist on the system anymore.

@jacobbednarz
Copy link

After a bit of investigation, it has turned out this was a side effect of swapping to systemd and the logind.conf having the RemoveIPC=yes as the default value. The user had a UID > 1000 and performed some operations by su to the required user before running. Upon logging out, it would wipe out the semaphores that semian was relying on and cause some unexpected behaviours.

There are probably some safe guards we can put in place to ensure that if the semaphores are pulled from under the operating processes that it handles it better however I'll open a PR if we can find anything worth while doing there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants