Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve robustness of ping function and reduce false flags #226

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

ishanjain28
Copy link

Hey,

Currently the ping function in this project pings a host just once and considers it a failure if sent packets != received packets.

This results in a lot of false flags. ICMP echo request packets are some times one of the top most packets that'll be dropped if the router needs room to process some thing else.

This PR changes this behavior and now,

  1. It'll send 3 packets with an interval of 100ms
  2. It'll only consider it a failure if there is more than 99% packet loss.

There is the "Notify after failures" setting but that's not quite what I want since these failures are still recorded and show up on the graph.

Ping a host 3 times with an interval of 100ms and consider it a failure
only if there is more than 99% packet loss to that host
@jemand771
Copy link
Member

heya and thanks for the PR!

I like the idea but I'm not really sure what a good threshold would be. I agree that a single lost packet shouldn't mean DOWN right away, but I also think that tolerating 99% loss is way too lenient.

we could make the loss% threshold configurable, but I'm not convinced people actually need to change it if the default is reasonable. the only thing left to do is finding a reasonable threshold :b

as a heads up, we're still catching up with older issues so getting this released might take a bit longer than expected

@ishanjain28
Copy link
Author

Hey! Thank you for looking into this. I agree, 99% packet loss is quite lenient and it should be changed. Perhaps 50-60% packet loss and 4-5 packets in each test might be a better compromise ?

Another option is sending packets every second with a timeout that starts when the first packet is sent and stopping the test when we have the first reply. This should also help in reducing false alarms.

@adamboutcher
Copy link
Collaborator

we could make the loss% threshold configurable

This would be my view.

@adamboutcher adamboutcher added the Improvement An Improvement of an already implemented feature label Jul 12, 2023
@adamboutcher
Copy link
Collaborator

Currently the ping function in this project pings a host just once and considers it a failure if sent packets != received packets.

Personally this is intended behaviour, however, it depends if you're using the tool to check for availability issues, or just general uptime overall, I do think changing the functionality would be useful for others as long as previously mentioned this can be configurable and default to the current behaviour.

@ishanjain28
Copy link
Author

ICMP is some times one of the first type of traffic to be dropped if the device/router is under load. This is why, It's not a reliable indicator for uptime or availability issues. (A "TCP Ping" is perhaps more suited).

With the current approach, It causes way too many false alarms to be useful in any capacity. The packet loss threshold can be made configurable or personally, I also like the second idea.

Another option is sending packets every second with a timeout that starts when the first packet is sent and stopping the test when we have the first reply.

@adamboutcher
Copy link
Collaborator

Then surely you just use the TCP port checking service?

@ishanjain28
Copy link
Author

Yess, I do that where ever I can but not every single service I wanted to monitor exposes a tcp listener I can use for health/availability checks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Improvement An Improvement of an already implemented feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants