New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document logging architecture #2170
Merged
Merged
Changes from all commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
65d3c02
Document logging architecture
QuentinBisson 89891e2
Add why loki and which logs are stored
QuentinBisson aa6839c
Update src/content/vintage/getting-started/observability/logging/arch…
QuentinBisson 3afa04d
Update src/content/vintage/getting-started/observability/logging/arch…
QuentinBisson 323347f
Update src/content/vintage/getting-started/observability/logging/arch…
QuentinBisson 4e211e6
Update src/content/vintage/getting-started/observability/logging/arch…
QuentinBisson 6bc6a6b
Update src/content/vintage/getting-started/observability/logging/arch…
QuentinBisson aae51ec
Update src/content/vintage/getting-started/observability/logging/arch…
QuentinBisson ba7c682
Update src/content/vintage/getting-started/observability/logging/arch…
QuentinBisson f10fd28
Add architecture diagram
QuentinBisson a9692fa
Add architecture diagram explaination
QuentinBisson 1298db6
Update src/content/vintage/getting-started/observability/logging/arch…
QuentinBisson c51f772
Update src/content/vintage/getting-started/observability/logging/arch…
QuentinBisson 876c534
Update src/content/vintage/getting-started/observability/logging/arch…
QuentinBisson b05a82d
Update src/content/vintage/getting-started/observability/logging/arch…
QuentinBisson 2c869d1
Update src/content/vintage/getting-started/observability/logging/arch…
QuentinBisson 5e50bce
Update src/content/vintage/getting-started/observability/logging/arch…
QuentinBisson 14f1457
Update src/content/vintage/getting-started/observability/logging/arch…
QuentinBisson f007ec6
Update src/content/vintage/getting-started/observability/logging/arch…
QuentinBisson 91b1a46
Update src/content/vintage/getting-started/observability/logging/arch…
QuentinBisson c7b4f52
Update src/content/vintage/getting-started/observability/logging/arch…
QuentinBisson 998b987
Update src/content/vintage/getting-started/observability/logging/arch…
QuentinBisson ee68f43
Update src/content/vintage/getting-started/observability/logging/arch…
QuentinBisson 803ebaf
Update src/content/vintage/getting-started/observability/logging/arch…
QuentinBisson File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
2 changes: 1 addition & 1 deletion
2
src/content/vintage/getting-started/observability/logging/_index.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,16 +1,16 @@ | ||
--- | ||
linkTitle: Logging | ||
title: Logging | ||
description: A serie of guides explaining how to interact with logs accessible within Giant Swarm clusters. | ||
weight: 30 | ||
menu: | ||
main: | ||
identifier: getting-started-observability-logging | ||
parent: getting-started-observability | ||
owner: | ||
- https://github.com/orgs/giantswarm/teams/team-atlas | ||
last_review_date: 2024-02-28 | ||
last_review_date: 2024-03-21 | ||
aliases: | ||
- /getting-started/observability/logging | ||
- /ui-api/observability/logs/ | ||
--- | ||
72 changes: 72 additions & 0 deletions
72
src/content/vintage/getting-started/observability/logging/architecture/index.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
--- | ||
linkTitle: Logging architecture | ||
title: Logging architecture | ||
description: Documentation on the logging architecture deployed and maintained by Giant Swarm. | ||
weight: 80 | ||
menu: | ||
main: | ||
identifier: getting-started-observability-logging-architecture | ||
parent: getting-started-observability-logging | ||
user_questions: | ||
- What is the logging architecture? | ||
- Why is Giant Swarm using Loki? | ||
- Why is Giant Swarm recommending Loki? | ||
- Which logs are stored by Giant Swarm? | ||
- Where are the logs stored by Giant Swarm? | ||
aliases: | ||
- /getting-started/observability/logging/architecture | ||
owner: | ||
- https://github.com/orgs/giantswarm/teams/team-atlas | ||
last_review_date: 2024-03-21 | ||
--- | ||
|
||
Logging is an important pillar of observability and it is thus only natural that Giant Swarm provides and manages a logging solution for operational purposes. | ||
|
||
This document gives an overview of how logging is managed by Giant Swarm, including which logs are stored, which tools we use to ship and store them, as well as why we chose those tools in the first place. | ||
|
||
## Overview of the logging platform | ||
|
||
Here is an architecture diagram of our current logging platform: | ||
|
||
![Logging pipeline architecture overview](logging-architecture.png) | ||
<!-- Source: https://drive.google.com/file/d/1Gzl0mTdJcaui_zIC9QuHcgMX3QJygALo --> | ||
|
||
In this diagram, you can see that we run the following tools in each management cluster as part of our logging platform: | ||
|
||
- `Grafana Loki` that is accessible through our managed Grafana instance. | ||
- `multi-tenant-proxy`, a proxy component used to handle multi-tenancy for Loki. | ||
- A couple of logging agents (`Grafana Promtail` and `Grafana Agent`) that run on the management cluster and your workload clusters alike. We currently need two different tools for different purposes. | ||
- Promtail is used to retrieve the container and kubernetes audit logs | ||
- Grafana Agent is used to retrieve the kubernetes events. | ||
|
||
If you want to play with Loki, you should definitely check out our guides explaining [how to access Grafana]({{< relref "/vintage/getting-started/observability/visualization/access" >}}) and how to [explore logs with LogQL]({{< relref "/vintage/getting-started/observability/visualization/log-exploration" >}}) | ||
|
||
## Logs stored by Giant Swarm | ||
|
||
Kubernetes clusters produce a vast amount of machine and container logs. | ||
|
||
The logging agents that we have deployed on management and workload clusters currently send the following logs to Loki: | ||
|
||
- Kubernetes Pod logs from the `kube-system` and `giantswarm` namespaces. | ||
- Kubernetes Events created in the `kube-system` and `giantswarm` namespaces. | ||
- [Kubernetes audit logs]({{< relref "./audit-logs#kubernetes-audit-logs" >}}) | ||
|
||
In the future, we will also store the following logs: | ||
|
||
- [Machine (Node) audit logs]({{< relref "./audit-logs#machine-audit-logs" >}}) | ||
- Teleport audit logs, tracked in https://github.com/giantswarm/roadmap/issues/3250 | ||
- Giant Swarm customer workload logs as part of our observability platform, tracked in https://github.com/giantswarm/roadmap/issues/2771 | ||
|
||
## Why we prefer Loki over its competitors | ||
|
||
There are numerous reasons to use Grafana Loki in favor of its competitors. | ||
|
||
First, we are **strong believers in Open Source** so the full Elastic stack is obviously out of the question. | ||
|
||
Second, we are quite used to the Grafana ecosystem, where the **individual tools are made to work with one another without requiring a closed ecosystem**. Alternative logging solutions are either intended to work in isolation (like OpenDistro) or need to use a full-fledged solution (i.e. being able to collect and correlate all observability data), which is rarely open-source (coming back to the first point above). | ||
|
||
Third, we are full-fledged users of Prometheus and PromQL. **LogQL, the Loki Query Language, is a natural extension to PromQL**, which makes it easy for our platform engineers to use and love. | ||
|
||
The fourth reason is **cost and resource consumption.** Loki is cheaper to run than its competitors because it does not rely as heavily on persistent storage and uses Object storage instead, which is always cheaper in the cloud. The storage of the index is also cheaper for Loki as it uses label-based indexing, which is smaller than any kind of text-based indexing solution used by full-text search engines. | ||
|
||
Finally, the last reason comes down to the history of Giant Swarm and it mostly boils down to **operation and maintenance**. Before we decided to run Loki, we used to run elasticsearch as our logging solution. Elasticsearch in itself is really hard to operate, especially at scale, even more so on Kubernetes because it is by its nature a stateful application (and for good reasons). This was an especially important factor in our decision since we do not need the full capabilities of OpenDistro like full-text search. |
Binary file added
BIN
+193 KB
...age/getting-started/observability/logging/architecture/logging-architecture.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions
4
src/content/vintage/getting-started/observability/logging/audit-logs/index.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't tell the customer much why it's helpful for them. What are operational purposes? Does that mean they or only we can make use of it? I'd recommend shortening this intro, include customer-relevant points and only describe what's important to them. I assume that customers can make use of the MC Grafana instance (see existing article
How to access Grafana
which btw has a misspelled titleit's
/its
) and therefore we should strongly hint here why the architecture is important for customers (e.g. long-term storage, persistence, easy access via Grafana UI, ...) but not go into extensive detail or opinions ("the full Elastic stack is obviously out of the question").@pipo02mix I'm very unclear what was discussed around vintage docs. Shouldn't we start from scratch rather than editing existing docs? If so, we'd better revamp the whole thing and go through the "battle card" preparation before writing full articles. Since I'm unclear about this, I'd like to have it clarified first and will stop my review here for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be biased but operational purposes is quite straightforward to me. We need it to operate our platform and those 2 sentences have the same meanings.
I am not talking about the customer use yet here, only that they can access it via grafana.
Now, maybe the why we use Loki instead of Grafana could be somewhere else, I wrote this under the after some discussions with @carillan81 and @pipo02mix .
The goal of this doc is to show what we have and show our expertise but maybe the battlecard approach was missed.
Regarding the structure, I put this under vintage for now because of the full revamp being done and I have no issues moving it and refining the content later on with said battlecard, but in the mean time, I think this type of doc is useful and needs to be there. I could be wrong also but it's still better than having nothing at the moment at least
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed – I was writing my direct thoughts to challenge and make everyone think if everything still fits what we have here. We should check from scratch which pieces we put where, and if we can make it all shorter to increase chances of reading these articles. For example, does it fit into introductional, instructional or rather advanced parts of our docs, etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm always happy with any feedback :)