Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UDP probe negative value for failure #741

Closed
Bashere1 opened this issue May 7, 2024 · 10 comments
Closed

UDP probe negative value for failure #741

Bashere1 opened this issue May 7, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@Bashere1
Copy link

Bashere1 commented May 7, 2024

Describe the bug
I am seeing negative values for failure counter metrics for the UDP probe type.
This is reflected both in the prometheus metric output for the failure metric counter and in prober logs.

Linux Version
22.04.2-Ubuntu

Cloudprober Version
v0.13.3

To Reproduce

cloudprober.cfg
`host: "redacted"

port: 8080

server {
  type: UDP
  udp_server {
    port: 85
    type: ECHO
  }
}
probe {
  name: "UDP-probes"
  type: UDP
  latency_unit: "ms"
  udp_probe {
    port: 85
  }
  targets {
    file_targets {
      file_path: "/opt/prober/vips.json"
    }
  }
  interval_msec: 1000  # 1s
  timeout_msec: 500   # 500ms
  additional_label {
    key: "dest"
    value: "@target.name@"
  }

  additional_label {
    key: "dest"
    value: "@target.label.dest@"
  }

  additional_label {
    key: "src"
    value: "@target.label.src@"
  }

}

vips.json

{
    "resource": [
        {
            "name": "redacted",
            "ip": "redacted",
            "port": 85,
            "labels": {
                "dest": "redacted,
                "src": "redacted"
            }
        }
    ]
}

Steps to reproduce the behavior:

I've been unable to identity the pattern, which causes this behavior.
I have the same config and prober version applied to 50+ other hosts, which do not see negative values for failure rate counter.
This occurs for approx ~5 out of the 50 hosts we have installed prober to, where we see negative values for failure.

If we restart the counters reset and start positive, but will gradually decrement to negative values.

@Bashere1 Bashere1 added the bug Something isn't working label May 7, 2024
@manugarg
Copy link
Contributor

manugarg commented May 7, 2024

@Bashere1 is your success metric also bigger than total? Also, I am assuming you see positive failure delta in some cycles and then negative failure delta.. is that correct?

Also, do you have just one target? That's what it looks like from the config (vips.json), but wanted to make sure.

@Bashere1
Copy link
Author

Bashere1 commented May 7, 2024

I did a little more digging and it appears that failure is a calculated metic.
It appears as if total isn't getting incremented or success is double incrementing.

I also reduced to just a single target dest for testing.

total{ptype="tcp",probe="TCP-probes",dst="reacated",dest="reacated",src="reacated"} 602 1715106857770
total{ptype="udp",probe="UDP-probes",dst="reacated",dest="reacated",src="reacated"} 3007 1715106859929
# TYPE success counter
success{ptype="tcp",probe="TCP-probes",dst="reacated",dest="reacated",src="reacated"} 602 1715106857770
success{ptype="udp",probe="UDP-probes",dst="reacated",dest="reacated",src="reacated"} 3008 1715106859929
# TYPE latency counter
latency{ptype="tcp",probe="TCP-probes",dst="reacated",dest="reacated",src="reacated"} 30515.632 1715106857770
latency{ptype="udp",probe="UDP-probes",dst="reacated",dest="reacated",src="reacated"} 138857.599 1715106859929
# TYPE failure counter
failure{ptype="tcp",probe="TCP-probes",dst="reacated",dest="reacated",src="reacated"} 0 1715106857770
failure{ptype="udp",probe="UDP-probes",dst="reacated",dest="reacated",src="reacated"} -1 1715106859929```

@manugarg
Copy link
Contributor

manugarg commented May 7, 2024

Thanks @Bashere1 for additional information. Does failure keep growing to be even more negative over time.

I'll take a look at the code to see how can success be larger than total.

@Bashere1
Copy link
Author

Bashere1 commented May 7, 2024

Yes, we do see positive failure delta in some cycles.
If I bring the dest host UDP port down failures will quickly climb back positive as expected.

@Bashere1
Copy link
Author

Bashere1 commented May 7, 2024

Yes, if all things stay consistent negative values will keep trending down over time.

@manugarg
Copy link
Contributor

manugarg commented May 8, 2024

Okay. I've been trying to reproduce it, but unsuccessful so far. From your description, it seems you're able to reproduce it pretty consistently, would you say in about an hour? What's the latency between these hosts -- from your comment it seems to be in ~50ms range, is that correct?

I'll keep trying to reproduce it. This code has not changed in a long time though, so root causing it will take time.

@Bashere1
Copy link
Author

Bashere1 commented May 8, 2024

I am able to recreate consistently, but it only appears to be from a single src.
I've not seen this behavior from any other src host with prober installed.
I should have added more details, but I see this behavior even across very low latency connections from this src host to a number of dest pairs < 0.2ms.

I'm going to try to reinstall on another host within the same subnet of the problem host to see if the behavior repeats.

Is there any debugging that I can enable that would help?

@manugarg
Copy link
Contributor

manugarg commented May 8, 2024

If it's just a single host, trying on a different host will help.

A couple of more questions:

  • Do "source machines" act as "destination" too, for other sources?
  • Are you probing just one destination per source? Also, are there multiple UDP probes?
  • Earlier you mentioned approx 5 out of 50 hosts exhibiting this behavior, did you mean 5 targets out of 50 targets?

(I am still running the prober to reproduce, but I may not have that scale).

@Bashere1
Copy link
Author

  • We have prober installed on about 50 hosts and within the 50 hosts we have a full mesh configuration setup where the src host will probe all other hosts.
  • source machines do not also act as dest
  • We only see this issue from a single src host to multiple dest, but those dest acting as a source do not see an issue to the other hosts in the mesh except for the single src host with the issue.
  • I installed on another compute within the same subnet and do not see this issue except to the known problem host acting as a dest.

This issue seems isolated to a single host and since it's such a perplexing issue I don't want to waste you time.

@manugarg
Copy link
Contributor

Thanks @Bashere1 for further testing! I'm going to close this issue then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants