-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scale Set Metrics ADR #2568
Scale Set Metrics ADR #2568
Conversation
|
||
### Metrics exposed by the controller | ||
|
||
To get a better understanding about health and workings of the cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i assume each metrics will be tagged with the RunnerScleSet id/name/url etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used labels that you assigned in POC which are great! We can always add or delete ones. That is the reason I said I'm not sure if we actually need to include them in the ADR. Changing labels or metrics would require an update on the ADR. But I wanted to display the basic structure and then trim it down if we need or extend it if you think we should
ephemeral runner pod after multiple retries, it will set the state of the | ||
`EphemeralRunner` to failed. Since the controller can not recover from this | ||
state, it can be useful to set Prometheus alerts to catch this issue quickly. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need metrics for EpheneralRunnerSet and AutoScalingRunnerSet level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are just controllers that are up. Not sure what added value will they bring but we can
EphemeralRunnerSet is exposing pending, failed and running ephemeral runners. I'd start from here and then extend them if there is a need for more metrics
service, it can expose actions service related data through metrics. In | ||
particular: | ||
|
||
- `available_jobs` - Number of jobs with `runs-on` matching the runner scale set name. Jobs are not yet assigned but are acquired by the runner scale set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
available_jobs
is for not acquired job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, mistake in copying...
|
||
Controller metrics belong to the `github_runner_scale_set_controller` subsystem, | ||
so the names are going to have `github_runner_scale_set_controller` prefix. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also include a section to doc all labels we will add to each metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, I kind of wanted to have them displayed here in case we need them. Maybe it should not even be in the ADR in case we want to extend metrics. I just included it for comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful as we create documentation to have the set of labels we apply for each metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, I'd love it if we could have the following:
- An ability to configure which labels to be added to metrics via command-line flags, and
- An ability to add useful but can-be-source-of-prometheus-cardinality-issue labels (like runner id, job id, and any kind of "ids" that differ across instances of jobs = high cardinality).
Regardless of if we document every supported label in ADR, I do prefer it if the ADR clearly says that it is supposed to support dangerous-and-useful labels, and at the same time, it is going to have the ability to toggle which labels to be added. Those two would enable users to label metrics with various IDs for easier debugging and monitoring for relatively small-scale ARC deployments and to turn those ID labels off for large-scale deployments.
This old thread might provide more context about that #2176 (comment)
### Metric names | ||
|
||
Listener metrics belong to the `github_runner_scale_set` subsystem, so the names | ||
are going to have the `github_runner_scale_set_` prefix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we cross any label name limits with this prefix? Also why not gh_runner_scale_set
and gh_runner_scale_set_controller
to stay consistent with the chart names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of all listeners. This is not a big problem but is something to point out. | ||
- Managing requests/limits can be tricky. | ||
|
||
### Use a Prometheus Pushgateway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Use a Prometheus Pushgateway | |
### Option 3: Use a Prometheus Pushgateway |
be applied across all `AutoscalingRunnerSets`, it is difficult to inherit this | ||
configuration by applying helm charts. | ||
|
||
### Create an aggregator service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am personally highly in favour of this approach, especially since this is the same way it was implemented in the legacy modes
Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>
We've done enough iterations on this, let's ship it 🚀 thank you for drafting a great doc. |
Propose a solution for exposing metrics by the
gha-runner-scale-set
andgha-runner-scale-set-controller