Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional metrics in Stackdriver surfacer to monitor channel and cache usage #262

Open
kushal288 opened this issue Feb 1, 2023 · 8 comments
Labels
enhancement New feature or request

Comments

@kushal288
Copy link
Contributor

For stackdriver surfacer, since it maintains a cache of metrics, we would also like to log the metric for monitoring the length of cache and channel at regular intervals. This will be used for high throughput traffic, where metrics might get accumulated in cache.

@kushal288 kushal288 added the enhancement New feature or request label Feb 1, 2023
@kushal288
Copy link
Contributor Author

@manugarg For implementing this, I we thinking that it should support emitting metrics in a similar way we do from probers.
However, I am observing that fmt.Printf("cacheLength %v", len(cache)) doesn't export this metric to surfacers.
For logging metrics from inside surfacers we did -

usage := float64(len(s.writeChan)*100.00) / float64(cap(s.writeChan))
em := metrics.NewEventMetrics(time.Now()).
	    AddLabel("surfacer", "custom_stackdriver").
	    AddMetric("channel_usage", metrics.NewFloat(usage)).
	    AddMetric("cache_entries", metrics.NewFloat(float64(len(s.cache))))
em.Kind = metrics.GAUGE
s.writeChan <- em.Clone()

This we could do in stackdriver surfacer, however if we would like log metrics from other surfacer how should we go about it?

@manugarg
Copy link
Contributor

manugarg commented Feb 1, 2023

@kushal288

Since the cache implementation is different for each surfacer, I think we'll need to implement this in each surfacer individually. I'll give it more thought, but one quick way to get the info you need will be to log these metrics using logger and then create metrics from the logs in stackdriver.

@manugarg
Copy link
Contributor

manugarg commented Feb 3, 2023

I think we can standardize internal metrics with the following labels: subsystem and subsystem_id. For example, for stackdriver surfacer:
subsystem=surfacer, subsystem_id=stackdriver

@kushal288
Copy link
Contributor Author

Yes, I think these labels makes sense. We can add these.
However, issue is currently any metric output to stdout from any surfacer implementation(e.g. fmt.Printf("cacheLength{subsystem=surfacer, subsystem_id=stackdriver} %v", len(cache)) doesn't export the metric to surfacer component, and so we are not able use existing infra for collection of metrics from actual surfacers itself.

@manugarg
Copy link
Contributor

manugarg commented Feb 6, 2023

@kushal288 So the way it works is slightly more complicated than that. We do build an EventMetrics object and pass it on surfacers through a channel. In this case, instead of pushing metrics to a channel, we can just use the surfacer's own Write() interface:

func (s *SDSurfacer) Write(_ context.Context, em *metrics.EventMetrics) {

So if the option is set to export internal metrics, we'll start another loop (in the New() method) and export metrics regularly like this:

func New(...) {
...
 if c.GetWriteInternalMetrics() {
    go internalMetricsLoop(ctx)
 }
}

func internalMetricsLoop(ctx..) {
   ...
   s.Write(ctx, metrics.NewEventMetrics(....))
   ...
}

(For the structure of internalMetricsLoop, take a look at other context based loops in the same file and other places maybe?)

Does it make sense, or I misunderstood your question. One thing to think about is though if you'd also like to log these metrics in case the surfacer is jammed for some reason and you would like some insights into what's going on? You can just log the metrics as ... l.Info(em.String()). Please let me know if it doesn't make sense. or is not clear.

@kushal288
Copy link
Contributor Author

Yes this would work for stackdriver surfacer internal metrics.
However, I was thinking of a more generic solution where I want to send metrics to stackdriver from other surfacer, e.g. File Surfacer.
Context is, we are planning to create a new surfacer for Bigquery insertion, and it may have relatively high throughput.
So we want to monitor the channel usage of BQSurfacer, using stackdriver metrics.

I am trying to figure out, how can we achieve this.

@manugarg
Copy link
Contributor

manugarg commented Feb 6, 2023

Ah, I see. I guess there are a couple of ways to tackle this. I am not yet sure which one makes the most sense:

(In no particular order)

  • Use log based metrics for such use cases. I think we can make our logs more structured to make it easy.
  • Save the system wide data channel in runconfig, and use runconfig API to share it across various module instead of passing it through function args. That way all modules will have access to it. One con of this option is the lack of explicit ordering between creation of data channel, surfacers processing it, and probes beginning to write to it. Not a huge deal, but it will become another think to take care of (we already have such things, e.g. global HTTP and gRPC servers).

Can you start with log based metrics for now, and move to a different approach if that doesn't work very well.

@kushal288
Copy link
Contributor Author

Yes we will use logs based metrics for monitoring.
We will see if need arises to have a second implementation.

Thanks for your support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants