Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExaBGP restart & reload race condition #1172

Open
koef opened this issue Aug 8, 2023 · 2 comments
Open

ExaBGP restart & reload race condition #1172

koef opened this issue Aug 8, 2023 · 2 comments
Assignees
Labels

Comments

@koef
Copy link

koef commented Aug 8, 2023

Hello ExaBGP Team,
Firstly, I'd like to express my appreciation for your exceptional product.

We utilize Ansible for installing and configuring ExaBGP in our setups. Below is our Ansible 'exabgp' role:

---
- name: Install packages
  ansible.builtin.apt:
    name:
      - exabgp
    state: present

- name: Configure ExaBGP
  ansible.builtin.template:
    src: exabgp.conf.j2
    dest: /etc/exabgp/exabgp.conf
    mode: 0644
  notify: Reload ExaBGP

- name: Enable and start ExaBGP
  ansible.builtin.systemd:
    name: exabgp
    enabled: true
    state: started

And here's the handler 'Reload ExaBGP':

---
- name: Reload ExaBGP
  ansible.builtin.systemd:
    name: exabgp
    state: reloaded
    enabled: true

Unfortunately, we've noticed an issue. Our monitoring system detected that 'exabgp.service' was restarted: "Systemd's exabgp.service restarted 1 times on node1."

This issue can be reproduced using the 'systemctl restart exabgp && systemctl reload exabgp' command. On my Ubuntu 22.04, the result is as follows:

# systemctl restart exabgp && systemctl reload exabgp
Job for exabgp.service failed because a fatal signal was delivered to the control process.
See "systemctl status exabgp.service" and "journalctl -xeu exabgp.service" for details.

The journal log provides this information:

Aug 07 13:57:38 node1 systemd[1]: Starting ExaBGP...
Aug 07 13:57:38 node1 systemd[1]: Started ExaBGP.
Aug 07 13:57:38 node1 systemd[1]: Reloading ExaBGP...
Aug 07 13:57:38 node1 systemd[1]: Reloaded ExaBGP.
Aug 07 13:57:38 node1 systemd[1]: exabgp.service: Main process exited, code=killed, status=10/USR1
Aug 07 13:57:38 node1 systemd[1]: exabgp.service: Failed with result 'signal'.
Aug 07 13:57:38 node1 systemd[1]: exabgp.service: Scheduled restart job, restart counter is at 1.
Aug 07 13:57:38 node1 systemd[1]: Stopped ExaBGP.
Aug 07 13:57:39 node1 systemd[1]: Starting ExaBGP...
Aug 07 13:57:39 node1 systemd[1]: Started ExaBGP.
...

From the logs, the issue arises because exabgp doesn't have sufficient time to start before systemd sends 'USR1'.

To address this, we applied the workaround by overriding the default exabgp unit:

cat /etc/systemd/system/exabgp.service.d/override.conf
[Service]
ExecStartPost=/bin/sleep 2

We'd appreciate you letting us know if there's a better solution.

To Reproduce

Steps to reproduce the behavior:
systemctl restart exabgp && systemctl reload exabgp

Expected behavior

Reload command sent immediately after restarting the service doesn't lead to failing and restarting the service a second time.

Environment:

  • OS: Ubuntu 22.04.3 LTS
  • Version 4.2.17
@thomas-mangin
Copy link
Member

Thank you for reporting this issue, I will try to look into it soon.

@thomas-mangin
Copy link
Member

I am not sure why you are referencing SIGUSR1, is it not SIGHUP being sent when reload is used?

Systemd may send a SIGHUP for reload which may lead to a stop of ExaBGP.

I believe I ended up not using SIGHUP for reload as it can be sent by terminals to indicate that it was closed. We are going back to 2009 when I made these decisions: systemd did not exist yet and standards were not as clear as now.

Systemd should issue a SIGTERM on "stop" (and "restart"). and then restart the program when the program terminates.

Perhaps the systemd file should be changed?

ExecReload=/bin/kill -SIGUSR1 $MAINPID

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants