Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry stats #515

Merged
merged 2 commits into from
Apr 11, 2022
Merged

Telemetry stats #515

merged 2 commits into from
Apr 11, 2022

Conversation

jakedt
Copy link
Member

@jakedt jakedt commented Apr 6, 2022

No description provided.

@jakedt jakedt requested a review from jzelinskie April 6, 2022 00:21
@github-actions github-actions bot added area/CLI Affects the command line area/dependencies Affects dependencies labels Apr 6, 2022
Copy link
Member

@jzelinskie jzelinskie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only some small things

internal/telemetry/metrics.go Show resolved Hide resolved
internal/telemetry/metrics.go Outdated Show resolved Hide resolved
internal/telemetry/metrics.go Outdated Show resolved Hide resolved
@jakedt jakedt force-pushed the telemetry-stats branch 2 times, most recently from 7071dbd to 892605f Compare April 6, 2022 13:35
@jakedt jakedt requested a review from ecordell April 6, 2022 15:33
internal/graph/check.go Outdated Show resolved Hide resolved
internal/middleware/usagemetrics/usagemetrics.go Outdated Show resolved Hide resolved

## Collected metrics

### spicedb_telemetry_info (Gauge)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I was reviewing I kept coming back to this doc to make sure it matched the impl, maybe we should throw something on the backlog to automate it

Interval = time.Hour
)

func writeTimeSeries(ctx context.Context, client *http.Client, endpoint string, ts []*prompb.TimeSeries) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, but I tried to find a spec for the remote write API and failed. The closest I got was https://github.com/prometheus/prometheus/blob/main/storage/remote/client.go#L191

Maybe we could use that implementation, or at least link back to it so that if there are changes between remote write api versions we know where to look

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we wouldn't want to use this actual client?

Msg("telemetry reporter scheduled")

// Fire off the first at start-up.
if err := discoverAndWriteMetrics(ctx, endpoint); err != nil {
Copy link
Contributor

@ecordell ecordell Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be jitter here so a large cluster starting up doesn't thunderously herd the write endpoint?

@@ -369,6 +383,8 @@ func (c *completedServerConfig) Run(ctx context.Context) error {
g.Go(c.dashboardServer.ListenAndServe)
g.Go(stopOnCancel(c.dashboardServer.Close))

g.Go(func() error { return telemetry.ReportForever(ctx, c.telemetryEndpoint) })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion here, but I like the model that the grpc/http servers follow, where if they're disabled the objects get replaced with "noop" servers that just log and stop.

Err(err).
Str("endpoint", endpoint).
Msg("failed to push telemetry metric")
nextPush = backoffInterval.NextBackOff()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if the backoff interval becomes longer than the initial interval?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the question.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normal interval is 1hr, can we exponentially backoff until the interval is 2hrs?

Don't we want a base frequency of 1hr regardless of backoff?

internal/telemetry/metrics.go Outdated Show resolved Hide resolved
})
}

switch *fam.Type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just use https://pkg.go.dev/github.com/prometheus/common/expfmt#ExtractSamples to avoid this, but I didn't test it

Copy link
Member

@jzelinskie jzelinskie Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked into it and this looks usable. We'd iterate over those samples and create the prompb.

internal/telemetry/metrics.go Show resolved Hide resolved
"github.com/golang/snappy"
dto "github.com/prometheus/client_model/go"
"github.com/prometheus/common/model"
"github.com/prometheus/prometheus/prompb"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can drop importing all of prometheus if we use this:

go.buf.build/protocolbuffers/go/prometheus/prometheus

ecordell
ecordell previously approved these changes Apr 8, 2022
Copy link
Contributor

@ecordell ecordell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@ecordell ecordell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

just had one thing we might want to think about for the future


return func(ctx context.Context) error {
// Smear the startup delay out over 10% of the reporting interval
startupDelay := time.Duration(rand.Int63n(int64(interval.Seconds()/10))) * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since that could be up to 10m after startup, maybe we should attempt to write on shutdown if we haven't written at all yet and the server has at least successfully started up. And actually, maybe we want separate telemetry on startup error rates.

I realize this is a fairly involved change for an uncommon edge case, maybe we just make an issue and do it later.

@jakedt jakedt dismissed jzelinskie’s stale review April 11, 2022 13:50

evan is the real reviewer

@jakedt jakedt merged commit a13d8fc into main Apr 11, 2022
@jakedt jakedt deleted the telemetry-stats branch April 11, 2022 13:51
@github-actions github-actions bot locked and limited conversation to collaborators Apr 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/CLI Affects the command line area/dependencies Affects dependencies
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants