Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardized OpenSLO to Markdown Support #146

Open
Maixy opened this issue Jun 29, 2022 · 2 comments
Open

Standardized OpenSLO to Markdown Support #146

Maixy opened this issue Jun 29, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@Maixy
Copy link

Maixy commented Jun 29, 2022

Problem to solve

OpenSLO definitions are great for programmatic interfaces but aren't great for human reading. As an SLO adopter, I want a way to synchronize information between my SLO Documents and OpenSLO definitions. The former (SLO Documents) often include information that isn't as useful from a programmatic point of view, such as verbose descriptions, architecture diagrams, data workflow diagrams, etc.

This information is still critical to the SLO lifecycle, for communicating with stakeholders and gaining alignment, it's just not useful as a core part of the programmatic specification.

Proposal

This still needs a bit of brainstorming, but I'd love a tool that can look at an OpenSLO definition and generate a basic SLO Document in markdown. Additional fields (like Architecture Diagram images, for example) could be stored in metadata with a standard naming convention would help organize and generate the resulting markdown.

Too much flexibility here gets really close to a full blown CMS, so we'll want to brainstorm ways to keep things simple while still providing sufficient value to SLO adopters.

Further details

This was discussed a bit in the OpenSLO slack : https://openslo.slack.com/archives/C0202J83M3R/p1656536916703469

Links / references

@Maixy Maixy added the enhancement New feature or request label Jun 29, 2022
@proffalken
Copy link
Collaborator

One of the many conversations I've had over the past few days was kind of related to this in the sense that we were talking about using some kind of cucumber syntax to spit out the OpenSLO definitions based on the awesome formatting that was in one of the talks on day 2 of Monitorama [citation needed ;) ]

The idea being that the following would translate into OpenSLO:

As a customer,
When I access the front page of the website,
I should see a response time of < 10ms

This could then generate an SLI that uses count_over_time(http_response_time{service=frontend} > 100) for Prometheus or similar and the appropriate total query, and from that build out the appropriate OpenSLO code.

This is just one way, I'm sure there are many others, especially as this doesn't use the existing MD format!

@bencompton
Copy link

bencompton commented May 28, 2023

While the OpenSLO specification is great for defining the implementation of SLOs, a high-level, human-readable explanation of the user journeys, etc. would definitely be useful. What is missing as well is the notion of ensuring that SLOs in production will continue to work as expected as code changes over time.

Piggybacking on the Cucumber idea, I think these goals could be accomplished by using Cucumber in concert with OpenSLOs without necessarily modifying OSLO itself. The OSLO spec is a declarative definition of the behavior of SLOs, and the Cucumber spec would be the human-readable executable specification for that behavior (as well as the behavior of the instrumented code driving the SLOs). If the OpenSLO spec is updated without also updating the Cucumber spec, then the affected Cucumber scenarios would fail. If a regression is introduced into the code that causes the OpenSLO calculations, alerts, etc. to stop working as expected, then the affected Cucumber scenarios will fail. Any additions to the OpenSLO spec will require corresponding additions to the Cucumber spec (just like how changing other types of code require changes to Cucumber specs).

Imagine a Cucumber spec like so (just shooting from the hip with a quick example):

...
...

Rule: When the error budget burn rate for the Home page latency SLOs exceeds 5% over 5 hours, an alert should be triggered

  Scenario: Error budget burn rate for Home page does not exceed 5% over the last 5 hours
    When the budget burn rate for the Home page does has not exceeded 5% over the last 5 hours
    Then no alerts should be triggered

  Scenario: Error budget burn rate greater than or equal to 5% over the last 5 hours
    When the budget burn rate for the Home page has exceeded 5% over the last 5 hours
    Then an alert should be triggered

...
...

Then imagine the app uses OpenTelemetry along with OpenSLOs and is architected for testability such that dependency injection determines whether OTel data goes into either:

  • An open OSLO implementation with an in-memory data store that can be queried via an open standard OTel query language used in the OSLO data sources
  • Or the real production o11y system (e.g., NR, DD) that also supports this open standard OTel query language to run the queries in the OSLO data sources

The Cucumber tests would exercise the code to create various scenarios with the OTel data emitted into this in-memory data store and would then execute the OSLO specs against this open OSLO implementation with the Telemetry Query Language queries being executed, alerts being triggered, etc. The Cucumber specs would ensure that the expected alerts, calculations, etc. are working as expected in each scenario.

I can imagine also having tooling to measure code coverage (i.e., how much of the OpenSLO specs are tested), as well perhaps automatic generation of Cucumber rules / scenarios from OpenSLOs. More traditional code coverage tooling could be useful for tying code changes to SLOs, like for an automated deployment quality gate to determine whether the SLOs affected by the code changes being deployed have sufficient error budget to do any deployments.

This would be difficult as of now because:

  • To my knowledge, an open standard query language for querying OTel data does not exist. TQL in otel-collector was renamed because it's a transformation language instead of a query language. There have been efforts like this to add a standardized query to the OTel spec that haven't gotten far, although this looks promising.
  • There is no open OSLO implementation that can run against an in-memory data store that supports this non-existent Open Telemetry Query Language There would need to be such an OSS project that ideally has language-specific implementations.

And of course, some downsides of this approach that come to mind would be:

  • Not as useful for testing composite SLOs that cross multiple service boundaries
  • Limits SLIs choices to those that can be accomplished with in the codebase (e.g., can't really test SLIs based on load balancer logs, and instead would be using traces from the application server)
  • The Cucumber spec would duplicate the knowledge in the OSLO spec, which many would consider not DRY. Of course, duplicating the knowledge of the code it is testing is what Cucumber always does, it's just that the code we're testing in this case is a declarative specification.
  • The OSS infrastructure for accomplishing this does not yet exist, although I'm sure the basic idea here could still be accomplished.
  • Legacy code not architected for testability would be cumbersome with this type of test (would be creating a ton of slow and flaky E2E tests that are a pain to maintain over time). The ideal type of codebase would be one where the high-level API can be exercised with all dependencies involving I/O can be replaced with dependencies involving no I/O via dependency injection (or partial application, etc.) to end up with tests that are fast, stable, and deterministic. Of course, I consider this a best practice anyway. ;)

TL;DR - SLOs as code = code that should be tested like any other code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants