-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition when stopping to tail files #10631
Fix race condition when stopping to tail files #10631
Conversation
d8df742
to
91d079e
Compare
91d079e
to
77ade31
Compare
// Concurrently stop the target and remove the file | ||
wg := sync.WaitGroup{} | ||
wg.Add(2) | ||
go func() { | ||
sleepRandomDuration(time.Millisecond * 10) | ||
target.Stop() | ||
wg.Done() | ||
|
||
}() | ||
go func() { | ||
sleepRandomDuration(time.Millisecond * 10) | ||
_ = os.Remove(logFile) | ||
wg.Done() | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concurrently stopping the target and removing the file would lead to the race condition, likely the cause of users' reports.
See other comments for the details of the race condition.
requireEventually(t, func() bool { | ||
return testutil.CollectAndCount(registry, "promtail_file_bytes_total") == 0 | ||
}, "expected file_bytes_total metric to be cleaned up") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test would fail before on this assertion - the file_bytes_total
was not correctly unregistered.
@@ -149,7 +150,6 @@ func (t *tailer) readLines() { | |||
// This function runs in a goroutine, if it exits this tailer will never do any more tailing. | |||
// Clean everything up. | |||
defer func() { | |||
t.cleanupMetrics() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we clean up metrics here, there is still possibility that the updatePositions
goroutine is going to run the t.MarkPositionAndSize()
and re-instate the totalBytes
metric, even when the file is already deleted.
Moving this cleanupMetrics()
to defer
of the updatePositions()
should ensure there is no race condition.
@@ -208,12 +208,13 @@ func (t *tailer) MarkPositionAndSize() error { | |||
} | |||
return err | |||
} | |||
t.metrics.totalBytes.WithLabelValues(t.path).Set(float64(size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the file is deleted as this line of code is executed, the t.tail.Tell()
can fail with an error, so we end up with only the totalBytes
metric set, but not the readBytes
metric set. As an additional measure, I propose updating both metrics only when they have both been correctly obtained.
iterations := 500 | ||
fakeHandler := make(chan fileTargetEvent, 10*iterations) | ||
for i := 0; i < iterations; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test will take about 10s and has a good chance of catching the issue, but to catch this problem reliably I had to run 5k+ iterations which takes 2-3min.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me, personally. Since these lines are sensitive to bugs if moved, I think we should add some comments explaining why they're being called where they are so nobody moves them around again without considering the race condition.
(This will still need an approval and merge from a Loki maintainer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🙌
**What this PR does / why we need it**: Users are getting `PromtailFileMissing` alerts, which are defined as: ``` promtail_file_bytes_total{namespace!=""} unless promtail_read_bytes_total{namespace!=""} ``` This appears to be due to a race condition where a file can be deleted while the file target is being stopped.
What this PR does / why we need it:
Users are getting
PromtailFileMissing
alerts, which are defined as:This appears to be due to a race condition where a file can be deleted while the file target is being stopped.
Special notes for your reviewer:
Checklist
CONTRIBUTING.md
guide (required)CHANGELOG.md
updatedadd-to-release-notes
labeldocs/sources/setup/upgrade/_index.md
production/helm/loki/Chart.yaml
and updateproduction/helm/loki/CHANGELOG.md
andproduction/helm/loki/README.md
. Example PR