Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sidecar usage for outside of GKE clusters #251

Open
mboveri opened this issue Aug 21, 2020 · 12 comments
Open

Sidecar usage for outside of GKE clusters #251

mboveri opened this issue Aug 21, 2020 · 12 comments

Comments

@mboveri
Copy link

mboveri commented Aug 21, 2020

Hello,

I have kubernetes clusters in multiple clouds (GCP, AWS, on-prem OpenStack) and would like to export all my prometheus metrics to Stackdriver. Right now, stackdriver-prometheus-sidecar does not have the ability to explicitly specify which service account credentials to use when communicating with the Google Cloud Monitoring (GCM) API. This means that the sidecar cannot function outside of GCE nodes, where Workload Identity normally provides authentication and authorization. It would be really nice if we were able to leverage the stackdriver-prometheus-sidecar to export metrics from our non-GCP Kubernetes clusters into GCM. Is it possible to add a configuration flag to the sidecar that specifies a location on disk where service account keys could be placed? That way, one could stash the service account keys in a kubernetes Secret object and mount them into the container, even on clusters outside of GCP.

@Dnefedkin
Copy link

The sidecar is using Google Cloud Client Library for Golang, which in turn use a library called Application Default Credentials (ADC). ADC allows to pass credentials using GOOGLE_APPLICATION_CREDENTIALS environment variable, see here . So you can create a secret containing JSON file with credentials as a volume, mount this volume to the sidecar container and set GOOGLE_APPLICATION_CREDENTIALS environment variable inside the container

@Dnefedkin
Copy link

Also note that you might have to specify --stackdriver.generic.location="some-location-maybe-your-datacenter-name" --stackdriver.generic.namespace="K8S-cluster-name" parameters for the sidecar, so that metrics are created using generic_task monitored resource.

@mboveri
Copy link
Author

mboveri commented Aug 26, 2020

I have the JSON, but am getting stuck on the volume mounting bit, do you have an example of that I could take a look at?

@mboveri
Copy link
Author

mboveri commented Aug 26, 2020

I was able to get that working but am now getting the following error in the sidecar's container logs:
level=warn ts=2020-08-26T21:17:02.122Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Unrecognized region or location.: timeSeries[0-199]"

level=warn ts=2020-08-26T21:16:44.686Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Field timeSeries[10].metric.type had an invalid value of \"external.googleapis.com/prometheus/clouddriver:jvm:memory:used\": The metric type must be a URL-formatted string with a domain and non-empty path.: timeSeries[10]; Field timeSeries[11].metric.type had an invalid value of \"external.googleapis.com/prometheus/clouddriver:jvm:memory:used\": The metric type must be a URL-formatted string with a domain and non-empty path.

As well as this error:
The metric type must be a URL-formatted string with a domain and non-empty path.

@mboveri
Copy link
Author

mboveri commented Aug 26, 2020

For anyone else attempting to mount volumes to Prometheus, the minimum version is 8.13.13 when Volume and VolumeMounts were added - helm/charts@ef0d749

@Dnefedkin
Copy link

I was able to get that working but am now getting the following error in the sidecar's container logs:
level=warn ts=2020-08-26T21:17:02.122Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Unrecognized region or location.: timeSeries[0-199]"

As I've mentioned in the comment above you need to pass --stackdriver.generic.location as a sidecar parameter to fill mandatory "location" label associated with generic_task monitored resource type.

level=warn ts=2020-08-26T21:16:44.686Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Field timeSeries[10].metric.type had an invalid value of "external.googleapis.com/prometheus/clouddriver:jvm:memory:used": The metric type must be a URL-formatted string with a domain and non-empty path.: timeSeries[10]; Field timeSeries[11].metric.type had an invalid value of "external.googleapis.com/prometheus/clouddriver:jvm:memory:used": The metric type must be a URL-formatted string with a domain and non-empty path.

Most probably you're not specifying a --include filter as a sidecar parameter, as a result the sidecar attempts to send ALL prometheus metrics to Google Cloud Monitoring. I doubt this is what you really want, as this can be costly - see Pricing. Please consider setting up --include correspondingly.

Note that metric names in Google Cloud Monitoring must be valid URL formatted strings, and the sidecar generates metric names in external.googleapis.com/prometheus/<prometheus_metric_name> format. clouddriver:jvm:memory:used is a prometheus metric name in this case, it has colons in the name, which makes generated Google Cloud Monitoring metric name to be an invalid URLs. If you really need to send metrics with colons in names to Google Cloud Monitoring, you have to use Prometheus relabeling feature to rename these metrics in Prometheus first.

@mboveri
Copy link
Author

mboveri commented Aug 28, 2020

Looks like setting the follow did the trick. We are working on getting some better filtering, so have stopped sending for now, but we were able to get the OpenStack cluster to connect and see metrics in the Metrics explorer before disabling.
`- --stackdriver.project-id={redacted}

  • --prometheus.wal-directory=/prometheus/wal
  • --stackdriver.kubernetes.location={redacted}
  • --stackdriver.kubernetes.cluster-name={redacted}
  • --stackdriver.generic.namespace={redacted}
  • --stackdriver.generic.location={redacted}`

Thanks for all your help @Dnefedkin !

@mboveri
Copy link
Author

mboveri commented Aug 28, 2020

We will also still need to figure out why also, we need to figure out why some metrics are getting rejected

we see errors like:
level=warn ts=2020-08-27T23:42:27.611Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Unrecognized region or location.: timeSeries[0-199]"

level=warn ts=2020-08-27T23:42:22.656Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = Field timeSeries[0].points[0].interval.start_time had an invalid value of \"2020-08-27T16:39:43.687-07:00\": The start time must be before the end time (2020-08-27T16:39:43.687-07:00) for the non-gauge metric 'external.googleapis.com/prometheus/container_fs_sector_writes_total'."

level=warn ts=2020-08-27T23:23:07.075Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Field timeSeries[100].metric.type had an invalid value of \"external.googleapis.com/prometheus/gate:hystrix:isCircuitBreakerOpen\": The metric type must be a URL-formatted string with a domain and non-empty path.: timeSeries[100];

level=warn ts=2020-08-27T23:23:13.486Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: The new labels would cause the metric external.googleapis.com/prometheus/kube_deployment_labels to have over 10 labels.: timeSeries[180]"

@mboveri
Copy link
Author

mboveri commented Aug 28, 2020

I think at least:
level=warn ts=2020-08-27T23:23:07.075Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Field timeSeries[100].metric.type had an invalid value of \"external.googleapis.com/prometheus/gate:hystrix:isCircuitBreakerOpen\": The metric type must be a URL-formatted string with a domain and non-empty path.: timeSeries[100];

May be related to your note about invalid URLs due to :'s though

@Dnefedkin
Copy link

level=warn ts=2020-08-27T23:42:22.656Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = Field timeSeries[0].points[0].interval.start_time had an invalid value of "2020-08-27T16:39:43.687-07:00": The start time must be before the end time (2020-08-27T16:39:43.687-07:00) for the non-gauge metric 'external.googleapis.com/prometheus/container_fs_sector_writes_total'."

container_fs_sector_writes_total sounds like a counter metric, not a gauge, so is should have start time before end time to reflect the time interval. If you want to represent this metric as a gauge, you can use static_metadata entry in the config file, see https://github.com/Stackdriver/stackdriver-prometheus-sidecar#file

level=warn ts=2020-08-27T23:23:13.486Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: The new labels would cause the metric external.googleapis.com/prometheus/kube_deployment_labels to have over 10 labels.: timeSeries[180]"

This sounds like a Google Cloud Monitoring API restriction, 10 labels maximum per time series.

@mboveri
Copy link
Author

mboveri commented Sep 2, 2020

Awesome, thanks @Dnefedkin !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants