Add optional metrics in Stackdriver surfacer to monitor channel and cache usage #262

kushal288 · 2023-02-01T08:55:57Z

For stackdriver surfacer, since it maintains a cache of metrics, we would also like to log the metric for monitoring the length of cache and channel at regular intervals. This will be used for high throughput traffic, where metrics might get accumulated in cache.

kushal288 · 2023-02-01T09:03:08Z

@manugarg For implementing this, I we thinking that it should support emitting metrics in a similar way we do from probers.
However, I am observing that fmt.Printf("cacheLength %v", len(cache)) doesn't export this metric to surfacers.
For logging metrics from inside surfacers we did -

usage := float64(len(s.writeChan)*100.00) / float64(cap(s.writeChan))
em := metrics.NewEventMetrics(time.Now()).
	    AddLabel("surfacer", "custom_stackdriver").
	    AddMetric("channel_usage", metrics.NewFloat(usage)).
	    AddMetric("cache_entries", metrics.NewFloat(float64(len(s.cache))))
em.Kind = metrics.GAUGE
s.writeChan <- em.Clone()

This we could do in stackdriver surfacer, however if we would like log metrics from other surfacer how should we go about it?

manugarg · 2023-02-01T18:01:36Z

@kushal288

Since the cache implementation is different for each surfacer, I think we'll need to implement this in each surfacer individually. I'll give it more thought, but one quick way to get the info you need will be to log these metrics using logger and then create metrics from the logs in stackdriver.

manugarg · 2023-02-03T01:37:58Z

I think we can standardize internal metrics with the following labels: subsystem and subsystem_id. For example, for stackdriver surfacer:
subsystem=surfacer, subsystem_id=stackdriver

kushal288 · 2023-02-06T07:28:05Z

Yes, I think these labels makes sense. We can add these.
However, issue is currently any metric output to stdout from any surfacer implementation(e.g. fmt.Printf("cacheLength{subsystem=surfacer, subsystem_id=stackdriver} %v", len(cache)) doesn't export the metric to surfacer component, and so we are not able use existing infra for collection of metrics from actual surfacers itself.

manugarg · 2023-02-06T07:42:48Z

@kushal288 So the way it works is slightly more complicated than that. We do build an EventMetrics object and pass it on surfacers through a channel. In this case, instead of pushing metrics to a channel, we can just use the surfacer's own Write() interface:

cloudprober/surfacers/stackdriver/stackdriver.go

Line 149 in d1b4071

func (s *SDSurfacer) Write(_ context.Context, em *metrics.EventMetrics) {

So if the option is set to export internal metrics, we'll start another loop (in the New() method) and export metrics regularly like this:

func New(...) {
...
 if c.GetWriteInternalMetrics() {
    go internalMetricsLoop(ctx)
 }
}

func internalMetricsLoop(ctx..) {
   ...
   s.Write(ctx, metrics.NewEventMetrics(....))
   ...
}

(For the structure of internalMetricsLoop, take a look at other context based loops in the same file and other places maybe?)

Does it make sense, or I misunderstood your question. One thing to think about is though if you'd also like to log these metrics in case the surfacer is jammed for some reason and you would like some insights into what's going on? You can just log the metrics as ... l.Info(em.String()). Please let me know if it doesn't make sense. or is not clear.

kushal288 · 2023-02-06T08:54:21Z

Yes this would work for stackdriver surfacer internal metrics.
However, I was thinking of a more generic solution where I want to send metrics to stackdriver from other surfacer, e.g. File Surfacer.
Context is, we are planning to create a new surfacer for Bigquery insertion, and it may have relatively high throughput.
So we want to monitor the channel usage of BQSurfacer, using stackdriver metrics.

I am trying to figure out, how can we achieve this.

manugarg · 2023-02-06T09:19:33Z

Ah, I see. I guess there are a couple of ways to tackle this. I am not yet sure which one makes the most sense:

(In no particular order)

Use log based metrics for such use cases. I think we can make our logs more structured to make it easy.
Save the system wide data channel in runconfig, and use runconfig API to share it across various module instead of passing it through function args. That way all modules will have access to it. One con of this option is the lack of explicit ordering between creation of data channel, surfacers processing it, and probes beginning to write to it. Not a huge deal, but it will become another think to take care of (we already have such things, e.g. global HTTP and gRPC servers).

Can you start with log based metrics for now, and move to a different approach if that doesn't work very well.

kushal288 · 2023-02-06T15:00:23Z

Yes we will use logs based metrics for monitoring.
We will see if need arises to have a second implementation.

Thanks for your support!

kushal288 added the enhancement New feature or request label Feb 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional metrics in Stackdriver surfacer to monitor channel and cache usage #262

Add optional metrics in Stackdriver surfacer to monitor channel and cache usage #262

kushal288 commented Feb 1, 2023

kushal288 commented Feb 1, 2023

manugarg commented Feb 1, 2023

manugarg commented Feb 3, 2023

kushal288 commented Feb 6, 2023

manugarg commented Feb 6, 2023

kushal288 commented Feb 6, 2023

manugarg commented Feb 6, 2023

kushal288 commented Feb 6, 2023

Add optional metrics in Stackdriver surfacer to monitor channel and cache usage #262

Add optional metrics in Stackdriver surfacer to monitor channel and cache usage #262

Comments

kushal288 commented Feb 1, 2023

kushal288 commented Feb 1, 2023

manugarg commented Feb 1, 2023

manugarg commented Feb 3, 2023

kushal288 commented Feb 6, 2023

manugarg commented Feb 6, 2023

kushal288 commented Feb 6, 2023

manugarg commented Feb 6, 2023

kushal288 commented Feb 6, 2023