Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition when stopping to tail files #10631

Merged
merged 3 commits into from
Sep 19, 2023

Conversation

thampiotr
Copy link
Contributor

@thampiotr thampiotr commented Sep 18, 2023

What this PR does / why we need it:

Users are getting PromtailFileMissing alerts, which are defined as:

promtail_file_bytes_total{namespace!=""} unless promtail_read_bytes_total{namespace!=""}

This appears to be due to a race condition where a file can be deleted while the file target is being stopped.

Special notes for your reviewer:

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
    • If the change is worth mentioning in the release notes, add add-to-release-notes label
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR

Sorry, something went wrong.

@thampiotr thampiotr force-pushed the thampiotr/fix-promtail-tailer-race-condition branch from d8df742 to 91d079e Compare September 18, 2023 14:24

Verified

This commit was signed with the committer’s verified signature.
chaudum Christian Haudum
@thampiotr thampiotr force-pushed the thampiotr/fix-promtail-tailer-race-condition branch from 91d079e to 77ade31 Compare September 18, 2023 14:25
Comment on lines +303 to +316
// Concurrently stop the target and remove the file
wg := sync.WaitGroup{}
wg.Add(2)
go func() {
sleepRandomDuration(time.Millisecond * 10)
target.Stop()
wg.Done()

}()
go func() {
sleepRandomDuration(time.Millisecond * 10)
_ = os.Remove(logFile)
wg.Done()
}()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concurrently stopping the target and removing the file would lead to the race condition, likely the cause of users' reports.

See other comments for the details of the race condition.

Comment on lines +324 to +327
requireEventually(t, func() bool {
return testutil.CollectAndCount(registry, "promtail_file_bytes_total") == 0
}, "expected file_bytes_total metric to be cleaned up")
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test would fail before on this assertion - the file_bytes_total was not correctly unregistered.

@@ -149,7 +150,6 @@ func (t *tailer) readLines() {
// This function runs in a goroutine, if it exits this tailer will never do any more tailing.
// Clean everything up.
defer func() {
t.cleanupMetrics()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we clean up metrics here, there is still possibility that the updatePositions goroutine is going to run the t.MarkPositionAndSize() and re-instate the totalBytes metric, even when the file is already deleted.

Moving this cleanupMetrics() to defer of the updatePositions() should ensure there is no race condition.

@@ -208,12 +208,13 @@ func (t *tailer) MarkPositionAndSize() error {
}
return err
}
t.metrics.totalBytes.WithLabelValues(t.path).Set(float64(size))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the file is deleted as this line of code is executed, the t.tail.Tell() can fail with an error, so we end up with only the totalBytes metric set, but not the readBytes metric set. As an additional measure, I propose updating both metrics only when they have both been correctly obtained.

Verified

This commit was signed with the committer’s verified signature.
chaudum Christian Haudum
Comment on lines +264 to +266
iterations := 500
fakeHandler := make(chan fileTargetEvent, 10*iterations)
for i := 0; i < iterations; i++ {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test will take about 10s and has a good chance of catching the issue, but to catch this problem reliably I had to run 5k+ iterations which takes 2-3min.

@thampiotr thampiotr marked this pull request as ready for review September 18, 2023 15:15
@thampiotr thampiotr requested a review from a team as a code owner September 18, 2023 15:15
Copy link
Member

@rfratto rfratto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, personally. Since these lines are sensitive to bugs if moved, I think we should add some comments explaining why they're being called where they are so nobody moves them around again without considering the race condition.

(This will still need an approval and merge from a Loki maintainer)

Verified

This commit was signed with the committer’s verified signature.
chaudum Christian Haudum
Copy link
Contributor

@MasslessParticle MasslessParticle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@tpaschalis tpaschalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🙌

@dannykopping dannykopping merged commit a8d5815 into main Sep 19, 2023
@dannykopping dannykopping deleted the thampiotr/fix-promtail-tailer-race-condition branch September 19, 2023 10:41
rhnasc pushed a commit to inloco/loki that referenced this pull request Apr 12, 2024

Verified

This commit was signed with the committer’s verified signature.
chaudum Christian Haudum
**What this PR does / why we need it**:

Users are getting `PromtailFileMissing` alerts, which are defined as:
```
promtail_file_bytes_total{namespace!=""} unless promtail_read_bytes_total{namespace!=""}
```

This appears to be due to a race condition where a file can be deleted
while the file target is being stopped.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants