UDP probe negative value for failure #741

Bashere1 · 2024-05-07T17:37:34Z

Describe the bug
I am seeing negative values for failure counter metrics for the UDP probe type.
This is reflected both in the prometheus metric output for the failure metric counter and in prober logs.

Linux Version
22.04.2-Ubuntu

Cloudprober Version
v0.13.3

To Reproduce

cloudprober.cfg
`host: "redacted"

port: 8080

server {
  type: UDP
  udp_server {
    port: 85
    type: ECHO
  }
}

probe {
  name: "UDP-probes"
  type: UDP
  latency_unit: "ms"
  udp_probe {
    port: 85
  }
  targets {
    file_targets {
      file_path: "/opt/prober/vips.json"
    }
  }
  interval_msec: 1000  # 1s
  timeout_msec: 500   # 500ms
  additional_label {
    key: "dest"
    value: "@target.name@"
  }

  additional_label {
    key: "dest"
    value: "@target.label.dest@"
  }

  additional_label {
    key: "src"
    value: "@target.label.src@"
  }

}

vips.json

{
    "resource": [
        {
            "name": "redacted",
            "ip": "redacted",
            "port": 85,
            "labels": {
                "dest": "redacted,
                "src": "redacted"
            }
        }
    ]
}

Steps to reproduce the behavior:

I've been unable to identity the pattern, which causes this behavior.
I have the same config and prober version applied to 50+ other hosts, which do not see negative values for failure rate counter.
This occurs for approx ~5 out of the 50 hosts we have installed prober to, where we see negative values for failure.

If we restart the counters reset and start positive, but will gradually decrement to negative values.

The text was updated successfully, but these errors were encountered:

manugarg · 2024-05-07T18:37:48Z

@Bashere1 is your success metric also bigger than total? Also, I am assuming you see positive failure delta in some cycles and then negative failure delta.. is that correct?

Also, do you have just one target? That's what it looks like from the config (vips.json), but wanted to make sure.

Bashere1 · 2024-05-07T18:39:14Z

I did a little more digging and it appears that failure is a calculated metic.
It appears as if total isn't getting incremented or success is double incrementing.

I also reduced to just a single target dest for testing.

total{ptype="tcp",probe="TCP-probes",dst="reacated",dest="reacated",src="reacated"} 602 1715106857770
total{ptype="udp",probe="UDP-probes",dst="reacated",dest="reacated",src="reacated"} 3007 1715106859929
# TYPE success counter
success{ptype="tcp",probe="TCP-probes",dst="reacated",dest="reacated",src="reacated"} 602 1715106857770
success{ptype="udp",probe="UDP-probes",dst="reacated",dest="reacated",src="reacated"} 3008 1715106859929
# TYPE latency counter
latency{ptype="tcp",probe="TCP-probes",dst="reacated",dest="reacated",src="reacated"} 30515.632 1715106857770
latency{ptype="udp",probe="UDP-probes",dst="reacated",dest="reacated",src="reacated"} 138857.599 1715106859929
# TYPE failure counter
failure{ptype="tcp",probe="TCP-probes",dst="reacated",dest="reacated",src="reacated"} 0 1715106857770
failure{ptype="udp",probe="UDP-probes",dst="reacated",dest="reacated",src="reacated"} -1 1715106859929```

manugarg · 2024-05-07T18:42:48Z

Thanks @Bashere1 for additional information. Does failure keep growing to be even more negative over time.

I'll take a look at the code to see how can success be larger than total.

Bashere1 · 2024-05-07T18:55:13Z

Yes, we do see positive failure delta in some cycles.
If I bring the dest host UDP port down failures will quickly climb back positive as expected.

Bashere1 · 2024-05-07T18:55:51Z

Yes, if all things stay consistent negative values will keep trending down over time.

manugarg · 2024-05-08T01:25:30Z

Okay. I've been trying to reproduce it, but unsuccessful so far. From your description, it seems you're able to reproduce it pretty consistently, would you say in about an hour? What's the latency between these hosts -- from your comment it seems to be in ~50ms range, is that correct?

I'll keep trying to reproduce it. This code has not changed in a long time though, so root causing it will take time.

Bashere1 · 2024-05-08T16:16:49Z

I am able to recreate consistently, but it only appears to be from a single src.
I've not seen this behavior from any other src host with prober installed.
I should have added more details, but I see this behavior even across very low latency connections from this src host to a number of dest pairs < 0.2ms.

I'm going to try to reinstall on another host within the same subnet of the problem host to see if the behavior repeats.

Is there any debugging that I can enable that would help?

manugarg · 2024-05-08T18:06:05Z

If it's just a single host, trying on a different host will help.

A couple of more questions:

Do "source machines" act as "destination" too, for other sources?
Are you probing just one destination per source? Also, are there multiple UDP probes?
Earlier you mentioned approx 5 out of 50 hosts exhibiting this behavior, did you mean 5 targets out of 50 targets?

(I am still running the prober to reproduce, but I may not have that scale).

Bashere1 · 2024-05-10T14:48:51Z

We have prober installed on about 50 hosts and within the 50 hosts we have a full mesh configuration setup where the src host will probe all other hosts.
source machines do not also act as dest
We only see this issue from a single src host to multiple dest, but those dest acting as a source do not see an issue to the other hosts in the mesh except for the single src host with the issue.
I installed on another compute within the same subnet and do not see this issue except to the known problem host acting as a dest.

This issue seems isolated to a single host and since it's such a perplexing issue I don't want to waste you time.

manugarg · 2024-05-16T16:14:36Z

Thanks @Bashere1 for further testing! I'm going to close this issue then.

Bashere1 added the bug Something isn't working label May 7, 2024

manugarg closed this as completed May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UDP probe negative value for failure #741

UDP probe negative value for failure #741

Bashere1 commented May 7, 2024 •

edited by manugarg

manugarg commented May 7, 2024

Bashere1 commented May 7, 2024 •

edited

manugarg commented May 7, 2024

Bashere1 commented May 7, 2024

Bashere1 commented May 7, 2024

manugarg commented May 8, 2024 •

edited

Bashere1 commented May 8, 2024

manugarg commented May 8, 2024

Bashere1 commented May 10, 2024

manugarg commented May 16, 2024

UDP probe negative value for failure #741

UDP probe negative value for failure #741

Comments

Bashere1 commented May 7, 2024 • edited by manugarg

manugarg commented May 7, 2024

Bashere1 commented May 7, 2024 • edited

manugarg commented May 7, 2024

Bashere1 commented May 7, 2024

Bashere1 commented May 7, 2024

manugarg commented May 8, 2024 • edited

Bashere1 commented May 8, 2024

manugarg commented May 8, 2024

Bashere1 commented May 10, 2024

manugarg commented May 16, 2024

Bashere1 commented May 7, 2024 •

edited by manugarg

Bashere1 commented May 7, 2024 •

edited

manugarg commented May 8, 2024 •

edited