Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support "set" metric type #183

Open
bryanlarsen opened this issue Feb 5, 2019 · 10 comments
Open

Support "set" metric type #183

bryanlarsen opened this issue Feb 5, 2019 · 10 comments

Comments

@bryanlarsen
Copy link

In statsd, the set type counts unique occurrences between flushes. This is not supported in statsd_exporter because "between flushes" is pretty meaningless in a pull system, especially if there are multiple servers scraping the exporter.

Originally requested in #112

@matthiasr requested opening a new issue if somebody had a good idea on how to implement Set. I wouldn't say that I have a good idea, but I do have some thoughts.

Option 1: assume single scraping server. Not a great solution, but would be sufficient for us, at least at the moment.

Option 2: create a statsd plugin that sends sets as gauges on flush. Requires the use of the statsd daemon.

Option 3: add an option for the flush interval; create a ticker from it that persists and resets the set counts every tick. If the option isn't set it could have a default, or it could just mean the user doesn't need set support.

#3 seems the best option, but #2 is definitely easier and good enough for us. So we'd love to help with #3, but if there's no interest we'd probably just go ahead and do #2 ourselves.

@matthiasr
Copy link
Contributor

matthiasr commented Feb 5, 2019 via email

@bryanlarsen
Copy link
Author

Are you thinking of "reaaaally long time" being the default? At least in our case that wouldn't be an issue. To do the count you'd need a map or a hyperloglog or something to count the unique instances, and I suppose for some people that could grow excessively. Our set sizes are around 100. So a long flush time would just keep counting the same objects repeatedly so no unbounded growth.

statsd uses a Set to count objects rather than a hyperloglog or anything fancy, so they're not worried about unbounded growth.

@bryanlarsen
Copy link
Author

Better answer: we won't have a count until after the first flush interval expires

@matthiasr
Copy link
Contributor

Hmmm, I think I get what sets do now. I'll try to explain it back to make sure:

When sending

foo:123|s
foo:456|s
foo:123|s

then the next time statsd flushes to graphite, it sends foo 2 <timestamp>, and then I send

foo:123|s
foo:456|s
foo:789|s

again, then on the next flush it will send foo 3 <timestamp> to graphite?

  • In the statsd/graphite setup, how do you aggregate the result over time? Is there a way to turn "uniques per 10s" into "daily active users"?
  • a common way (that I always recommend) to deploy statsd exporter is to have many of them (one per application instance) – how could one aggregate uniques across them?
  • Is the flush interval usually aligned on some wall clock time?
  • What would happen if multiple statsd exporters are not aligned?

@matthiasr
Copy link
Contributor

  • Does statsd accept arbitrary strings as values, or does it have to be a number? An integer or a float?

@matthiasr
Copy link
Contributor

From the implementation, I don't see a problem with option 3. I don't think we need hyperloglog or anything. A map[<value type>]struct{} would probably suffice, and then when flushing we take it out, replace it with an empty map, and count the keys. If someone really needs to count billions of objects within seconds this exporter is not the tool for them, I think 😆

My concern is whether this will actually produce a metric that is useful to anyone. The obvious output would be a gauge with the set count, but then I wonder how one would actually use that. And if Prometheus scrapes less frequently than the scrape interval, or the alignments are odd, you'd completely lose the information about a flush interval. Would it make sense to also observe the count for each interval into a histogram?

@bryanlarsen
Copy link
Author

We're using strings for our sets. Looking at the statsd source code, it appears that they're storing all values as strings.

As for your concern, isn't this the same issue that pretty much every gauge will have? A gauge is typically a continuous signal sampled at a regular period and pushed to statsd. This is then scraped at a different period. So if it's pushed more frequently than it's scraped, samples will be dropped. Annoying, but given that it's a continuous signal, there are infinite number of potential samples that we're necessarily dropping.

In our case, the signal we're measuring with the set is also a continuous signal. It's the number of idle workers. Each worker periodically sends its ID to statsd while it's idle. So we have 3 periods we have to contend with! In our case, the worker reporting period is 10 seconds and the flush interval is 60 seconds, so we can tolerate up to 5 dropped packets. Our scrape interval is 30 seconds (the default; haven't found a reason to tune it yet).

The measurement we care about is the minimum. We don't want to run out of available workers. So yes, if the scrape interval becomes a significant multiple of the flush interval than the dropped samples might be painful.

However, I think this is less of a concern for sets than it would be for other gauge users. Most gauges are easier to sample more frequently than a set is. For example, one could sample a temperature ten times per second.

So I think your concern is quite independent of sets. It'd probably be quite useful to add the ability to histogram gauges, along with sets as a specific type of gauge. But I don't that request belongs under this specific issue.

@matthiasr
Copy link
Contributor

Ok, that makes sense, and that's an interesting use case! Do you feel up to implementing this? As I said above, I think a simple map to hold the set will do to start with. To keep things simple for users I would turn it on and set a reasonable default on the flush interval – 1 minute maybe? If there are no recorded sets then there won't be anything to clean up so it won't use measurable resources.

One thing to keep in mind is not to leak too many goroutines when reloading the configuration – one way to do that would be to trigger a last flush on reload and then tear down any flushing routines. (this is a suggestion – if you think there's a better way then to that!)

@bryanlarsen
Copy link
Author

That's a good question! I don't have anything beyond a superficial exposure to Go, but I don't expect that to be much of a stumbling block. And the time I have to do this sort of thing is fairly limited. As I said #2 would definitely be easier and sufficient for us, but I'd definitely be interested in doing #3 if you're willing to provide guidance.

@matthiasr
Copy link
Contributor

Absolutely! My Go is also … functional, at best, but we'll work through it 😄 open a PR early and I can give feedback. If you allow edits from maintainers I can also make changes to it directly, if opportune.

@bryanlarsen bryanlarsen mentioned this issue Feb 8, 2019
3 tasks
@matthiasr matthiasr changed the title Set type not supported Support "set" metric type Jun 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants