Skip to content
This repository has been archived by the owner on Jul 31, 2023. It is now read-only.

DistributionData and view.Rows in general do not get reset between Reporting Intervals #1181

Open
marwan-at-work opened this issue Oct 30, 2019 · 3 comments · May be fixed by #1182
Open

DistributionData and view.Rows in general do not get reset between Reporting Intervals #1181

marwan-at-work opened this issue Oct 30, 2019 · 3 comments · May be fixed by #1182
Labels

Comments

@marwan-at-work
Copy link

marwan-at-work commented Oct 30, 2019

I have a program that records the latency of a request using the OCHTTP server middleware.

In my program, I am using the DataDog OpenCensus Exporter to report this metric.

Therefore, I wanted to simulate a latency spike by having my handler do the following:

  1. The first 4 minutes, the HTTP Handler takes between 200-400ms to respond.

  2. The second 6 minutes, the HTTP handler takes between 1-2seconds (latency spike)

  3. The third 4 minutes, the HTTP handler takes between 200-400ms (latency goes back to normal)

The latency gets reported every 10 seconds

In other words, OpenCensus calls DataDog's ExportView every 10 seconds.

OpenCensus collects the Min/Max latencies and passes them as DistributionData to the DataDog exporter.

My expectation is that for every 10 seconds, OpenCensus calculates the Min/Max/Mean/etc and passes them to an Exporter, but then it should reset those numbers so that they don't obscure the next bucket of time. Is this assumption incorrect?

Even if this assumption is wrong, let's continue with how this behavior has obscured the latency spike significantly:

The fact that OpenCensus does not "reset" the Min every 10 seconds, means that the Min stays the same across the entire lifetime of the Go process.

Therefore, if we had a dashboard of the "min" value only, you'll see that the min never went up, it stays as ~200ms for the entirety of the 14 minutes. Even though for 6 minutes, the Min value was at least 1000ms.

On the other hand, if you have a "max" metric, you'll see that it went up to ~2s, but then it never went back down even though the traffic spike ended after 10 minutes.

Predictably, if I "kill" the process and restart it, suddenly the dashboard gets fixed and that's because we no longer hold the "min/max" that was previously preventing the dashboard to be accurate. See the following screen shot for demonstration:

Screen Shot 2019-10-30 at 2 27 26 AM

What you see here is that the Purple line is the Max metric while the Blue line is the Min metric.

When 4 minutes passed, the Purple line went up, but the Blue line stayed down. This is because even though new incoming requests were clocking between 1-2 seconds, OpenCensus kept comparing those numbers to the Min that was initially recorded in the first 4 minutes.

However, once I kill the process and restart the server (resuming the traffic spike) you can see the Min gets updated. Similarly, when the spike finished, the Max never goes down until I forcefully reset the Max.

As far as I understand, and at the very least, restarting the process should not significantly affect what the chart looks like as seen in the screenshot above

My understanding is that OpenCensus creates a metric for a bunch of Measures per Reporting Interval (I can easily be wrong about this). Therefore, if I manually empty out the DistributionData after every interval, you can see the Dashboard now looks very correct:

ocacc

Finally, I have created a reproduction Repository here: https://github.com/marwan-at-work/rcrepro

I included a Vendor folder that fmt.Println's the incoming DistributionData and you can notice that even when the traffic starts to go up to 1-2 seconds for 6 minutes, the Min value that DataDog receives stays the same.

AFAIK, there's no need to actually set up Datadog, you can just run the program and inspect the logs.

Also, I'm not familiar with the OpenCensus codebase (until this issue), but all I had to do to fix it was add this line: marwan-at-work@1380fae#diff-2000ebb97830a1f0f1c5c4856a737f78R236

Not sure how thorough the unit tests are, but this didn't break any of them.

@marwan-at-work marwan-at-work linked a pull request Oct 30, 2019 that will close this issue
@marwan-at-work marwan-at-work changed the title DistributionData and view.Rows in general do not get reset between ReportingIntervals DistributionData and view.Rows in general do not get reset between Reporting Intervals Nov 1, 2019
@marwan-at-work
Copy link
Author

marwan-at-work commented Nov 7, 2019

@odeke-em gentle ping to see if you or anyone would be able to look at this. Thanks!

@odeke-em
Copy link
Member

odeke-em commented Nov 7, 2019 via email

@rghetia
Copy link
Contributor

rghetia commented Nov 7, 2019

@marwan-at-work I just commented on your PR.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants