Add initial metrics #651

ronensc · 2023-09-27T09:53:39Z

Issue link

What changes have been made

This is another take on #573 to add initial metrics considering the redesign of codeflare-operator.

Verification steps

I checked that the metrics are available in the codeflare-operator operator by running it:

# in codeflare-operator root
go mod edit -replace github.com/project-codeflare/multi-cluster-app-dispatcher=../multi-cluster-app-dispatcher
make build
NAMESPACE=default go run ./main.go -kubeconfig <path-to-kubeconfig>

and in a new shell:

curl -s localhost:8080/metrics | grep mcad

# HELP mcad_allocatable_capacity_cpu Allocatable CPU Capacity (in milicores)
# TYPE mcad_allocatable_capacity_cpu gauge
mcad_allocatable_capacity_cpu 49665
# HELP mcad_allocatable_capacity_gpu Allocatable GPU Capacity
# TYPE mcad_allocatable_capacity_gpu gauge
mcad_allocatable_capacity_gpu 0
# HELP mcad_allocatable_capacity_memory Allocatable Memory Capacity
# TYPE mcad_allocatable_capacity_memory gauge
mcad_allocatable_capacity_memory 2.07376797696e+11

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- Testing is not required for this change

Add dependency: sigs.k8s.io/controller-runtime v0.14.6 go get sigs.k8s.io/controller-runtime@v0.14.6 go mod tidy

openshift-ci · 2023-09-27T09:53:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sutaakar for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ronensc · 2023-09-27T09:54:28Z

pkg/controller/queuejob/metrics.go

+	allocatableCapacityCpu := prometheus.NewGaugeFunc(prometheus.GaugeOpts{
+		Subsystem: "mcad",
+		Name:      "allocatable_capacity_cpu",
+		Help:      "Allocatable CPU Capacity (in milicores)",
+	}, func() float64 { return controller.GetAllocatableCapacity().MilliCPU })
+	metrics.Registry.MustRegister(allocatableCapacityCpu)
+
+	allocatableCapacityMemory := prometheus.NewGaugeFunc(prometheus.GaugeOpts{
+		Subsystem: "mcad",
+		Name:      "allocatable_capacity_memory",
+		Help:      "Allocatable Memory Capacity",
+	}, func() float64 { return controller.GetAllocatableCapacity().Memory })
+	metrics.Registry.MustRegister(allocatableCapacityMemory)
+
+	allocatableCapacityGpu := prometheus.NewGaugeFunc(prometheus.GaugeOpts{
+		Subsystem: "mcad",
+		Name:      "allocatable_capacity_gpu",
+		Help:      "Allocatable GPU Capacity",
+	}, func() float64 { return float64(controller.GetAllocatableCapacity().GPU) })


The calls to GetAllocatableCapacity() are quite expensive to be executed in GaugeFuncs. Should I call it periodically and cache the values like in the original PR?
https://github.com/project-codeflare/multi-cluster-app-dispatcher/pull/573/files#diff-cd034f141bebdabd3ed7e6118e9581ec4646707beaf8438f164ae51f382f2c1dR69-R77

I believe caching the values like in the original PR is the right call.

Right, listing all the Nodes and all the Pods clearly does not scale at all.

We need to reconsider re-activating / improving the cache, intersecting #588. That may provide a smooth transition once we migrate over to controller-runtime.

Otherwise, we can consider a simple caching mechanism, e.g. that would rate limit the call to allocatableCapacity using something like golang.org/x/time/rate.NewLimiter.

I made another commit to schedule the expensive call to run once a minute. I wasn't sure how to use golang.org/x/time/rate.NewLimiter without making it look like an overkill.

ChristianZaccaria · 2023-09-27T13:00:50Z

Nice one! I just followed the verification steps, it works and shows the CPU, GPU, and memory allocatable capacity 👍. Something to note:

I had to change the last bit of this line (on the operator) to v1alpha1 as I was getting the following error after running make build:

github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/apis/quotaplugins/quotasubtree/v1: module github.com/project-codeflare/multi-cluster-app-dispatcher@latest found (v1.35.0, replaced by ../multi-cluster-app-dispatcher), but does not contain package github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/apis/quotaplugins/quotasubtree/v1

Does this need to be changed over in the codeflare-operator?

ChristianZaccaria · 2023-09-27T13:02:01Z

As commented in the mentioned PR, I'm not sure if this would require an ADR as well. cc: @dimakis @anishasthana @astefanutti

ChristianZaccaria · 2023-09-27T13:02:54Z

I think we would want to be exposing on port 8083 as done for the original PR

ronensc

@ChristianZaccaria Thanks for reviewing and trying this up! :)

Nice one! I just followed the verification steps, it works and shows the CPU, GPU, and memory allocatable capacity 👍. Something to note:
* I had to change the last bit of [this line (on the operator)](https://github.com/project-codeflare/codeflare-operator/blob/c3383c8e19a8a8c94169c218f1ed8c9faa8e12d5/main.go#L30) to `v1alpha1` as I was getting the following error after running `make build`:
github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/apis/quotaplugins/quotasubtree/v1: module github.com/project-codeflare/multi-cluster-app-dispatcher@latest found (v1.35.0, replaced by ../multi-cluster-app-dispatcher), but does not contain package github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/apis/quotaplugins/quotasubtree/v1

Does this need to be changed over in the codeflare-operator?

Good catch! I missed this bit of change I had to make too. I think this is due to #650 and once there is a new relase of MCAD, this change in the operator is mandatory.

I think we would want to be exposing on port 8083 as done for the original PR

IIUC, after the redesign, the exposing port is the business of the operator. Its default value is set here
https://github.com/project-codeflare/codeflare-operator/blob/99d2fc8b4e3bdcada5259f4417a8c2b2ca342430/main.go#L98
A different value can be set by setting metrics.bindAddress in codeflare-operator-config config map

eranra · 2023-09-27T14:13:32Z

@ronensc wonderful. I will be happy to close #573 --- tell me when :-)

astefanutti · 2023-09-28T09:01:49Z

I think we would want to be exposing on port 8083 as done for the original PR

IIUC, after the redesign, the exposing port is the business of the operator. Its default value is set here https://github.com/project-codeflare/codeflare-operator/blob/99d2fc8b4e3bdcada5259f4417a8c2b2ca342430/main.go#L98 A different value can be set by setting metrics.bindAddress in codeflare-operator-config config map

That's correct. MCAD is only responsible for the instrumentation and the registration of the metrics. Serving these metrics is the responsibility of the CodeFlare operator.

rachelt44 · 2023-10-03T10:05:14Z

@ChristianZaccaria any conclusion on the necessity of an ADR? I started working on it.

ronensc · 2023-11-06T12:12:15Z

@ChristianZaccaria @astefanutti Is there anything I can do to make progress on this PR?

Add initial metrics

8f96ada

Add dependency: sigs.k8s.io/controller-runtime v0.14.6 go get sigs.k8s.io/controller-runtime@v0.14.6 go mod tidy

openshift-ci bot requested review from ChristianZaccaria and dmatch01 September 27, 2023 09:53

ronensc commented Sep 27, 2023

View reviewed changes

rachelt44 mentioned this pull request Oct 4, 2023

ADR for exposing MCAD observability metrics project-codeflare/adr#12

Closed

4 tasks

ronensc added 2 commits October 17, 2023 10:29

Update metrics once a minute

7713655

Fix typo

2a4a551

This was referenced Oct 31, 2023

Add a metric of AppWrappers counts per state #675

Open

Add histogram metrics of the requested resources #676

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial metrics #651

Add initial metrics #651

ronensc commented Sep 27, 2023

openshift-ci bot commented Sep 27, 2023

ronensc Sep 27, 2023

ChristianZaccaria Sep 27, 2023

astefanutti Sep 28, 2023

ronensc Oct 17, 2023

ChristianZaccaria commented Sep 27, 2023 •

edited

ChristianZaccaria commented Sep 27, 2023

ChristianZaccaria commented Sep 27, 2023

ronensc left a comment •

edited

eranra commented Sep 27, 2023

astefanutti commented Sep 28, 2023

rachelt44 commented Oct 3, 2023

ronensc commented Nov 6, 2023

Add initial metrics #651

Are you sure you want to change the base?

Add initial metrics #651

Conversation

ronensc commented Sep 27, 2023

Issue link

What changes have been made

Verification steps

Checks

openshift-ci bot commented Sep 27, 2023

ronensc Sep 27, 2023

Choose a reason for hiding this comment

ChristianZaccaria Sep 27, 2023

Choose a reason for hiding this comment

astefanutti Sep 28, 2023

Choose a reason for hiding this comment

ronensc Oct 17, 2023

Choose a reason for hiding this comment

ChristianZaccaria commented Sep 27, 2023 • edited

ChristianZaccaria commented Sep 27, 2023

ChristianZaccaria commented Sep 27, 2023

ronensc left a comment • edited

Choose a reason for hiding this comment

eranra commented Sep 27, 2023

astefanutti commented Sep 28, 2023

rachelt44 commented Oct 3, 2023

ronensc commented Nov 6, 2023

ChristianZaccaria commented Sep 27, 2023 •

edited

ronensc left a comment •

edited