-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exabgp stuck and cause tcp zerowindow error #1207
Comments
Currently travelling, short answer, it is probably the pipe filling .. most reported issue such as #1168 |
I will correctly read your issue when I can but you did not provide the information I request for help so can you please do, thank you . |
It still stuck when I set api.ack to off. |
The -d output is like:
While the log files is(I replaced some sensitive info with xxx):
Please let me know if you need any more info. |
I support bug fixes on master and perform partial backport to the 4.2 branch - depending on complexity. So, could you please re-check with The syntax to start exabgp on master changed slightly, I attempted to keep some backward compatibility, but please check the exabgp command line syntax if you need to leave the 4.2 branch. I will assume you used the version provided by your OS vendor. The simpler may be to de-install the package and either re-install exabgp via pip or as a zipapp as explained on the readme, and adding exabgp to the path if required. |
For information, exabgp has no code to set the TCP window to zero, so I will attempt an educated guess that it is caused by the OS as the network inbound buffer in the kernel for the application is full. It would happen if the packets were not "consumed" by ExaBGP. For it to happen, the application would need to be stuck (the code is async). I only saw this behaviour due to the pipe being filled to or from the external process. Previously, some other users failed to realise that they had multiple versions of exabgp installed on the machine (having exabgp installed via pip and an os package is common). It led them to think they had set the pipe option when they had not. You could check by changing the server command line to print the environment to be sure it is not the issue. |
I also tried the version 4.2.21 and get the same result. And I will try the main branch tomorrow. |
OK, I will try this, just to be clear, by checking the environment, I should focus on the api.ack option, right? |
I can not tell you what you should focus on. I can only offer guesses and I am not a very good fortune teller. I have no idea what your code is doing, you did not share it. I can only tell you what may cause issues from past users. But yes, you should perhaps try to run the program from the command line. ExaBGP can print what it parsed from the environment and env configuration file. If correct you could run it with -d from the command line and see if the issue is occuring. If it does happen, you may want to run the program with |
OK, I will give it a try. I will come back with more detail info if the issue still happened. |
I tried with the version 4.2.21 and make sure the api.ack is set to false with the --fi option and it still fails. I cannot try the master branch since I cannot upgrade the python from 3.6 to 3.7 on live server. My process script is simple, just get the messages from stdin and append them to a file. The exabgp config file is:
And the process script is:
And attach with the strace output here: |
Looking at the release note of Python 3.7, 3.6 could still work for master. Reading the strace before receiving SIGTERM:
This ExaBGP writing to your program.
This is your program flushing to the file.
Your program is trying to read from stdin and timing out. I do not see any
ExaBGP writing something with time ... as strace is not setup to send the whole string,
It was killed Try changing your code to have:
This may give you an idea of what is wrong and if something is wrong. But at first sight nothing obviously wrong. |
I run the sbin/exabgp on master branch and it says exabgp need python3.7. I will try to use release to rebuild an exabgp. And I will also try to see if there are anything wrong with the script. |
I can not remember why 3.7 was picked over 3.6 ... You can try to edit the test in the code to change it. |
Turns out the master branch is working (at least not stuck at the first stage) after I remove the validation of python version. I will do more testing and come back to close the issue if everything works fine. |
I can now remember why python 3.6 was dropped: supporting it on GitHub is a real pain. The code on master was updated to run on 3.6 and some backward compatibility code was added so it should work, but I need to setup an old Ubuntu 20.04 VM to test why the testing is not working, so it is now disabled. |
OS: CentOS Linux release 7.3.1611
Version: 4.2.21
What I did:
I established an ibgp link-state neighbor with a Cisco ios-xr device, when the topology is small, it all works fine. But when we use the same configuration on a live topology with more node and links(about 800 nodes and 6000 links), it comes with an error.
The exabgp can only receive 3 update messages and then it cannot get any more messages from stdin. I checked the tcp queue in Centos and shows it is full. From the peer Cisco device, all the messages are stuck in output queue.
I tried to capture packets in the Centos(sorry that I cannot share the pcap file since there is sensetive infos inside). I can see the first update message comes at 6.08 seconds.
Then all the other update messages come from 10.95 second to 11.31 second, there are about 150 update packages contains more the 300 update messages come within 0.4 seconds, then Centos send tcp zerowindow message to Cisco device.
I tried to replace exabgp with gobgp and gobgp can work properly. So maybe there are some issue with exabgp to deal with large mount of update messages.
Please let me know if you need any more info.
The text was updated successfully, but these errors were encountered: