Sidecar usage for outside of GKE clusters #251

mboveri · 2020-08-21T18:39:11Z

Hello,

I have kubernetes clusters in multiple clouds (GCP, AWS, on-prem OpenStack) and would like to export all my prometheus metrics to Stackdriver. Right now, stackdriver-prometheus-sidecar does not have the ability to explicitly specify which service account credentials to use when communicating with the Google Cloud Monitoring (GCM) API. This means that the sidecar cannot function outside of GCE nodes, where Workload Identity normally provides authentication and authorization. It would be really nice if we were able to leverage the stackdriver-prometheus-sidecar to export metrics from our non-GCP Kubernetes clusters into GCM. Is it possible to add a configuration flag to the sidecar that specifies a location on disk where service account keys could be placed? That way, one could stash the service account keys in a kubernetes Secret object and mount them into the container, even on clusters outside of GCP.

Dnefedkin · 2020-08-21T20:16:38Z

The sidecar is using Google Cloud Client Library for Golang, which in turn use a library called Application Default Credentials (ADC). ADC allows to pass credentials using GOOGLE_APPLICATION_CREDENTIALS environment variable, see here . So you can create a secret containing JSON file with credentials as a volume, mount this volume to the sidecar container and set GOOGLE_APPLICATION_CREDENTIALS environment variable inside the container

Dnefedkin · 2020-08-21T20:52:24Z

Also note that you might have to specify --stackdriver.generic.location="some-location-maybe-your-datacenter-name" --stackdriver.generic.namespace="K8S-cluster-name" parameters for the sidecar, so that metrics are created using generic_task monitored resource.

mboveri · 2020-08-26T17:19:32Z

I have the JSON, but am getting stuck on the volume mounting bit, do you have an example of that I could take a look at?

Dnefedkin · 2020-08-26T17:25:28Z

Take a look at https://stackoverflow.com/questions/47021469/how-to-set-google-application-credentials-on-gke-running-through-kubernetes , there is also example in the official GCP docs as well - https://cloud.google.com/kubernetes-engine/docs/tutorials/authenticating-to-cloud-platform#step_4_import_credentials_as_a_secret

mboveri · 2020-08-26T21:24:50Z

I was able to get that working but am now getting the following error in the sidecar's container logs:
level=warn ts=2020-08-26T21:17:02.122Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Unrecognized region or location.: timeSeries[0-199]"

level=warn ts=2020-08-26T21:16:44.686Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Field timeSeries[10].metric.type had an invalid value of \"external.googleapis.com/prometheus/clouddriver:jvm:memory:used\": The metric type must be a URL-formatted string with a domain and non-empty path.: timeSeries[10]; Field timeSeries[11].metric.type had an invalid value of \"external.googleapis.com/prometheus/clouddriver:jvm:memory:used\": The metric type must be a URL-formatted string with a domain and non-empty path.

As well as this error:
The metric type must be a URL-formatted string with a domain and non-empty path.

mboveri · 2020-08-26T21:37:06Z

For anyone else attempting to mount volumes to Prometheus, the minimum version is 8.13.13 when Volume and VolumeMounts were added - helm/charts@ef0d749

Dnefedkin · 2020-08-27T15:41:58Z

I was able to get that working but am now getting the following error in the sidecar's container logs:
level=warn ts=2020-08-26T21:17:02.122Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Unrecognized region or location.: timeSeries[0-199]"

As I've mentioned in the comment above you need to pass --stackdriver.generic.location as a sidecar parameter to fill mandatory "location" label associated with generic_task monitored resource type.

level=warn ts=2020-08-26T21:16:44.686Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Field timeSeries[10].metric.type had an invalid value of "external.googleapis.com/prometheus/clouddriver:jvm:memory:used": The metric type must be a URL-formatted string with a domain and non-empty path.: timeSeries[10]; Field timeSeries[11].metric.type had an invalid value of "external.googleapis.com/prometheus/clouddriver:jvm:memory:used": The metric type must be a URL-formatted string with a domain and non-empty path.

Most probably you're not specifying a --include filter as a sidecar parameter, as a result the sidecar attempts to send ALL prometheus metrics to Google Cloud Monitoring. I doubt this is what you really want, as this can be costly - see Pricing. Please consider setting up --include correspondingly.

Note that metric names in Google Cloud Monitoring must be valid URL formatted strings, and the sidecar generates metric names in external.googleapis.com/prometheus/<prometheus_metric_name> format. clouddriver:jvm:memory:used is a prometheus metric name in this case, it has colons in the name, which makes generated Google Cloud Monitoring metric name to be an invalid URLs. If you really need to send metrics with colons in names to Google Cloud Monitoring, you have to use Prometheus relabeling feature to rename these metrics in Prometheus first.

mboveri · 2020-08-28T00:16:18Z

Looks like setting the follow did the trick. We are working on getting some better filtering, so have stopped sending for now, but we were able to get the OpenStack cluster to connect and see metrics in the Metrics explorer before disabling.
`- --stackdriver.project-id={redacted}

--prometheus.wal-directory=/prometheus/wal
--stackdriver.kubernetes.location={redacted}
--stackdriver.kubernetes.cluster-name={redacted}
--stackdriver.generic.namespace={redacted}
--stackdriver.generic.location={redacted}`

Thanks for all your help @Dnefedkin !

mboveri · 2020-08-28T00:32:23Z

We will also still need to figure out why also, we need to figure out why some metrics are getting rejected

we see errors like:
level=warn ts=2020-08-27T23:42:27.611Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Unrecognized region or location.: timeSeries[0-199]"

level=warn ts=2020-08-27T23:42:22.656Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = Field timeSeries[0].points[0].interval.start_time had an invalid value of \"2020-08-27T16:39:43.687-07:00\": The start time must be before the end time (2020-08-27T16:39:43.687-07:00) for the non-gauge metric 'external.googleapis.com/prometheus/container_fs_sector_writes_total'."

level=warn ts=2020-08-27T23:23:07.075Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Field timeSeries[100].metric.type had an invalid value of \"external.googleapis.com/prometheus/gate:hystrix:isCircuitBreakerOpen\": The metric type must be a URL-formatted string with a domain and non-empty path.: timeSeries[100];

level=warn ts=2020-08-27T23:23:13.486Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: The new labels would cause the metric external.googleapis.com/prometheus/kube_deployment_labels to have over 10 labels.: timeSeries[180]"

mboveri · 2020-08-28T00:35:31Z

I think at least:
level=warn ts=2020-08-27T23:23:07.075Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Field timeSeries[100].metric.type had an invalid value of \"external.googleapis.com/prometheus/gate:hystrix:isCircuitBreakerOpen\": The metric type must be a URL-formatted string with a domain and non-empty path.: timeSeries[100];

May be related to your note about invalid URLs due to :'s though

Dnefedkin · 2020-09-01T16:33:42Z

level=warn ts=2020-08-27T23:42:22.656Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = Field timeSeries[0].points[0].interval.start_time had an invalid value of "2020-08-27T16:39:43.687-07:00": The start time must be before the end time (2020-08-27T16:39:43.687-07:00) for the non-gauge metric 'external.googleapis.com/prometheus/container_fs_sector_writes_total'."

container_fs_sector_writes_total sounds like a counter metric, not a gauge, so is should have start time before end time to reflect the time interval. If you want to represent this metric as a gauge, you can use static_metadata entry in the config file, see https://github.com/Stackdriver/stackdriver-prometheus-sidecar#file

level=warn ts=2020-08-27T23:23:13.486Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: The new labels would cause the metric external.googleapis.com/prometheus/kube_deployment_labels to have over 10 labels.: timeSeries[180]"

This sounds like a Google Cloud Monitoring API restriction, 10 labels maximum per time series.

mboveri · 2020-09-02T19:14:19Z

Awesome, thanks @Dnefedkin !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sidecar usage for outside of GKE clusters #251

Sidecar usage for outside of GKE clusters #251

mboveri commented Aug 21, 2020 •

edited

Dnefedkin commented Aug 21, 2020

Dnefedkin commented Aug 21, 2020

mboveri commented Aug 26, 2020

Dnefedkin commented Aug 26, 2020 •

edited

mboveri commented Aug 26, 2020 •

edited

mboveri commented Aug 26, 2020

Dnefedkin commented Aug 27, 2020

mboveri commented Aug 28, 2020

mboveri commented Aug 28, 2020

mboveri commented Aug 28, 2020

Dnefedkin commented Sep 1, 2020

mboveri commented Sep 2, 2020

Sidecar usage for outside of GKE clusters #251

Sidecar usage for outside of GKE clusters #251

Comments

mboveri commented Aug 21, 2020 • edited

Dnefedkin commented Aug 21, 2020

Dnefedkin commented Aug 21, 2020

mboveri commented Aug 26, 2020

Dnefedkin commented Aug 26, 2020 • edited

mboveri commented Aug 26, 2020 • edited

mboveri commented Aug 26, 2020

Dnefedkin commented Aug 27, 2020

mboveri commented Aug 28, 2020

mboveri commented Aug 28, 2020

mboveri commented Aug 28, 2020

Dnefedkin commented Sep 1, 2020

mboveri commented Sep 2, 2020

mboveri commented Aug 21, 2020 •

edited

Dnefedkin commented Aug 26, 2020 •

edited

mboveri commented Aug 26, 2020 •

edited