Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(agent): output buffer persistence #15221

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

DStrand1
Copy link
Contributor

Summary

Implements the write-through buffer persistence strategy detailed in the spec added in #14928.

Currently this PR is mostly a draft of separating the output buffer into multiple implementations, as well as experimentation with a WAL file library. Largest outstanding issue is around metric serialization to []byte.

  • Using the influx parser+serializer causes a cyclical import, but also drops the metric value type field. However this is probably the cleanest option to investigate since it would make the WAL files easy to re-import
  • Another suggestion was to use encoding/gob, which is what the PR is currently doing. However this has issues with un-exported fields, so I need to look more into how to work around this

Checklist

  • No AI generated code was used in this PR

Related issues

Related to #802, #14805

@telegraf-tiger telegraf-tiger bot added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Apr 24, 2024
Copy link
Contributor

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @DStrand1 for this draft. It looks quite nice overall and I left some suggestions in the code.

What concerns me a bit is the underlying WAL implementation. Looking through their code I discovered some fundamental flaws.

  1. The WAL file is not opened with SYNC, so any issue on disk or with Telegraf will loose more metrics than expected (in the worst case with a lot of RAM) all metrics are lost...
  2. The filenames are saved with the pattern <user defined name>.<index>.<offset> and are string-sorted, so if the index or offset cross the order-boundary (i.e. a digit is added 9 -> 10) the order is messed up as lala.10.xyz follows lala.1.xyz instead of lala.9.xyz! This will in turn mess up the metric order.
  3. The WAL implementation does not care about removing files, so if you got WAL file(s) and restart Telegraf multiple time the metrics in the WAL will be written multiple times without further handling. Doing this handling outside of the WAL implementation is hard as you cannot know which file was processed.
  4. In the current form the WAL implementation is not capable of truncate the files front-to-back, so metrics are prone to be sent multiple times if the file was not completely flushed...

There are more (smaller) issues, but looking at the issues above I think you should go for another WAL library that is more mature. I looked at https://github.com/tidwall/wal in the past and think it can do what we need. Not a must, just a suggestion. ;-)

config/config.go Outdated Show resolved Hide resolved
models/buffer.go Outdated Show resolved Hide resolved
models/buffer.go Outdated Show resolved Hide resolved
models/buffer_disk.go Outdated Show resolved Hide resolved
models/buffer_disk.go Outdated Show resolved Hide resolved
models/buffer_mem.go Outdated Show resolved Hide resolved
models/buffer.go Outdated Show resolved Hide resolved
@srebhan srebhan self-assigned this Apr 25, 2024
@powersj
Copy link
Contributor

powersj commented Apr 25, 2024

@DStrand1,

Thanks for the PR, super excided to see this. To capture what we talked about in pairs; specifically, to have you address some of the initial comments and switch to the proposed WAL library.

Let's plan to talk through where you are at or any issues with the new library in Monday's pairs.

Thanks!

@DStrand1 DStrand1 requested a review from srebhan April 30, 2024 16:01
Copy link
Contributor

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @DStrand1 for the nice update! My only concern (despite test-coverage) is the error-handling. IMO we should pass errors up to be able to log them e.g. if we are running out-of-disk. For unrecoverable errors (e.g. corrupt WAL files, non-serializable metrics, etc) we probably should panic, but everything else should propagate up to the plugin level (at least).

Comment on lines 65 to 73
func (b *DiskBuffer) addSingle(metric telegraf.Metric) bool {
err := b.walFile.Write(b.writeIndex(), b.metricToBytes(metric))
metric.Accept()
if err == nil {
b.metricAdded()
return true
}
return false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really worth the code duplication or should we use the batch-interface instead?

return index
}

func (b *DiskBuffer) Add(metrics ...telegraf.Metric) int {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should return an error here to log it or do other mitigation...

return written
}

func (b *DiskBuffer) Batch(batchSize int) []telegraf.Metric {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return the error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants