Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/stanza] Configurable behaviour of recombine operator when entries don't match is_first_entry #31653

Open
kkujawa-sumo opened this issue Mar 8, 2024 · 12 comments
Assignees
Labels
enhancement New feature or request pkg/stanza

Comments

@kkujawa-sumo
Copy link
Contributor

kkujawa-sumo commented Mar 8, 2024

Component(s)

pkg/stanza

Is your feature request related to a problem? Please describe.

The change introduced in #30797 causes different behaviour when entries don't match is_first_entry, please see this line.
The previous behaviour was convenient, user could have one pipeline for multiline logs and non-multiline logs, non-multiline were unchanged (with the consequences that in some circumstances multiline logs could be sent without recombiniation), now non-multiline logs are merged.

Describe the solution you'd like

Configurable behaviour of recombine operator when entries don't match is_first_entry, possible options: merged, unchanged.

Describe alternatives you've considered

Two pipelines for logs collection, one for multiline logs, another for non-multiline logs but it will require some awareness about logs format and selection of the pipeline for log file.

Additional context

No response

@kkujawa-sumo kkujawa-sumo added enhancement New feature or request needs triage New item requiring triage labels Mar 8, 2024
Copy link
Contributor

github-actions bot commented Mar 8, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@sumo-drosiek
Copy link
Member

Previous behavior was intentional: open-telemetry/opentelemetry-log-collection#415

@andrzej-stencel
Copy link
Member

andrzej-stencel commented Mar 8, 2024

This request might sound weird in the general, abstract sense of how the recombine operator should work. After all, it should be simple: given an incoming log, if it matches is_first_line, then it starts a new output log; if it does not match is_first_line, it is added to the previous one.

The world is not as abstract and simple. As linked by @sumo-drosiek in the above comment, there is a specific scenario where different behavior might be desirable. Users that run the Otel collector in Kubernetes clusters and use the Filelog receiver with the recombine operator to collect multiline logs, might want to not merge logs together if there wasn't a "first line" to trigger the merging.

I'm in favor of making this behavior configurable. It would be good to come up with a name for this configuration option that is descriptive, compelling and short enough at the same time. Brainstorming some proposals:

  • before_first_line: merge|nomerge - this nicely underlines the relation of this option with the is_first_line option
  • eager: true|false - this is short, but I'm not sure if it's unequivocal enough. Perhaps with a good description it would be fine.
  • merge_unmatched: true|false - this is a bit misleading, because the option is only about merging or not merging of the entries that came before the first "first line" match.

Happy to hear others' opinions.

@djaglowski
Copy link
Member

Apologies for overlooking open-telemetry/opentelemetry-log-collection#415. I was working to clean up this operator and couldn't find a rationale for the behavior.

Reviewing the original issue, it seems the justification was basically to handle an error scenario more gracefully.

However, I think this is an undesirable behavior in another case, where the file just happens to start with some non-matching lines (maybe was rotated partway through writing a multiline log) but still should be recombined up until the first match.

user could have one pipeline for multiline logs and non-multiline logs

This is a clever use of the behavior but not one which the operator was ever intended to support. That said, I can appreciate how most of the parsers and transformers have an if option to evaluate against each individual entry whether to be skipped, while the recombine operator would necessarily rely on statefulness to do something similar.


I'm generally in support of supporting all of the above use cases while making sure the default behavior is intuitive and safe.

What do you think about the following proposal, @kkujawa-sumo, @sumo-drosiek, @astencel-sumo?

Instead of a flag which switches between the previous and current behavior (i.e. all or nothing), we add a setting, called something like max_unmatched_batch_size. This would behave similarly to max_batch_size but would only apply when using is_first_entry until a first entry has been matched.

  • A value of 0 would recombine all lines until is_first_entry is matched (current behavior).
  • A value of 1 would force every line to emit separately until is_first_entry is matched (previous behavior).
  • A higher value would recombine but still protects against joining the entire file as described in original issue.
  • The default should IMO be something like 10 or 100 so that the operator does what it says by default, and is also safe by default.

@andrzej-stencel
Copy link
Member

Thanks @djaglowski for digging into this. Your proposal of max_unmatched_batch_size is acceptable to me, happy to hear others' thoughts. We'd just need to decide on the default value. 😄

One more question to pin down the behavior. Is the max_unmatched_batch_size property applied only once, or does it apply until the is_first_entry is matched? For example, assuming user configures max_unmatched_batch_size to 10 and the file has a match on is_first_entry expression on line 51, should the operator output first 50 lines in five 10-line entries, or should it output one 10-line entry and one 40-line entry?

@andrzej-stencel andrzej-stencel removed the needs triage New item requiring triage label Mar 12, 2024
@djaglowski
Copy link
Member

One more question to pin down the behavior. Is the max_unmatched_batch_size property applied only once, or does it apply until the is_first_entry is matched? For example, assuming user configures max_unmatched_batch_size to 10 and the file has a match on is_first_entry expression on line 51, should the operator output first 50 lines in five 10-line entries, or should it output one 10-line entry and one 40-line entry?

I was imagining it would apply repeatedly, until is_first_entry is matched.

@kkujawa-sumo
Copy link
Contributor Author

The proposition with max_unmatched_batch_size resolves the problem described in the issue 😄

If I understand correctly we will see following behavior of the max_unmatched_batch_size:

case 1

is_first_entry = test1

max_unmatched_batch_size = 0

input:


test2
test3
test4
test5

output:

test2\ntest3\ntest4\ntest5

case 2

is_first_entry = test1

max_unmatched_batch_size = 1


input:

test2
test3
test4
test5

output:

test2
test3
test4
test5

case 3

is_first_entry = test1

max_unmatched_batch_size = 2


input:

test2
test3
test4
test5
test6
test1
test7
test8

output:

test2\ntest3
test4\ntest5
test6
test1\ntest7\ntest8

@h0cheung
Copy link
Contributor

Apologies for overlooking open-telemetry/opentelemetry-log-collection#415. I was working to clean up this operator and couldn't find a rationale for the behavior.

Reviewing the original issue, it seems the justification was basically to handle an error scenario more gracefully.

However, I think this is an undesirable behavior in another case, where the file just happens to start with some non-matching lines (maybe was rotated partway through writing a multiline log) but still should be recombined up until the first match.

user could have one pipeline for multiline logs and non-multiline logs

This is a clever use of the behavior but not one which the operator was ever intended to support. That said, I can appreciate how most of the parsers and transformers have an if option to evaluate against each individual entry whether to be skipped, while the recombine operator would necessarily rely on statefulness to do something similar.

I'm generally in support of supporting all of the above use cases while making sure the default behavior is intuitive and safe.

What do you think about the following proposal, @kkujawa-sumo, @sumo-drosiek, @astencel-sumo?

Instead of a flag which switches between the previous and current behavior (i.e. all or nothing), we add a setting, called something like max_unmatched_batch_size. This would behave similarly to max_batch_size but would only apply when using is_first_entry until a first entry has been matched.

  • A value of 0 would recombine all lines until is_first_entry is matched (current behavior).
  • A value of 1 would force every line to emit separately until is_first_entry is matched (previous behavior).
  • A higher value would recombine but still protects against joining the entire file as described in original issue.
  • The default should IMO be something like 10 or 100 so that the operator does what it says by default, and is also safe by default.

When it reaches max_unmatched_batch_size but still haven't been matched, is it a good idea to output the original single-line logs separately?

@kkujawa-sumo
Copy link
Contributor Author

When it reaches max_unmatched_batch_size but still haven't been matched, is it a good idea to output the original single-line logs separately?

@h0cheung It depends, some users would like to use one pipeline for different formats of logs.
The proposed concept tries to compromise different approaches.
If somebody don't want see separated lines until the the first entry appears then somebody can use max_unmatched_batch_size=0

@kkujawa-sumo
Copy link
Contributor Author

You can assign me this issue ;)

djaglowski pushed a commit that referenced this issue Apr 3, 2024
…ize to recombine operator (#32144)

**Description:** Add a new `max_unmatched_batch_size` config parameter
to configure the maximum number of consecutive entries that will be
combined into a single entry before the match occurs

**Link to tracking Issue:**
#31653

**Testing:** unit tests, manual tests

**Documentation:** Add description of the new config option
@djaglowski
Copy link
Member

Closed by #32144

@djaglowski djaglowski reopened this Apr 3, 2024
@djaglowski
Copy link
Member

Reopened due to #32154

ycombinator pushed a commit to ycombinator/opentelemetry-collector-contrib that referenced this issue Apr 9, 2024
…ize to recombine operator (open-telemetry#32144)

**Description:** Add a new `max_unmatched_batch_size` config parameter
to configure the maximum number of consecutive entries that will be
combined into a single entry before the match occurs

**Link to tracking Issue:**
open-telemetry#31653

**Testing:** unit tests, manual tests

**Documentation:** Add description of the new config option
andrzej-stencel pushed a commit that referenced this issue Apr 11, 2024
…ize to recombine operator (#32168)

**Description:** Add a new max_unmatched_batch_size config parameter to
configure the maximum number of consecutive entries that will be
combined into a single entry before the match occurs

**Link to tracking Issue:**
#31653

**Testing:** unit tests, manual tests

**Documentation:** Add description of the new config option

Changes from
#32144
with improvements in tests
andrzej-stencel pushed a commit that referenced this issue Apr 16, 2024
…rations with max_unmatched_batch_size (#32347)

**Description:** add example configurations for recombine operator with
`max_unmatched_batch_size`

**Link to tracking Issue:** #31653
rimitchell pushed a commit to rimitchell/opentelemetry-collector-contrib that referenced this issue May 8, 2024
…ize to recombine operator (open-telemetry#32144)

**Description:** Add a new `max_unmatched_batch_size` config parameter
to configure the maximum number of consecutive entries that will be
combined into a single entry before the match occurs

**Link to tracking Issue:**
open-telemetry#31653

**Testing:** unit tests, manual tests

**Documentation:** Add description of the new config option
rimitchell pushed a commit to rimitchell/opentelemetry-collector-contrib that referenced this issue May 8, 2024
…ize to recombine operator (open-telemetry#32168)

**Description:** Add a new max_unmatched_batch_size config parameter to
configure the maximum number of consecutive entries that will be
combined into a single entry before the match occurs

**Link to tracking Issue:**
open-telemetry#31653

**Testing:** unit tests, manual tests

**Documentation:** Add description of the new config option

Changes from
open-telemetry#32144
with improvements in tests
rimitchell pushed a commit to rimitchell/opentelemetry-collector-contrib that referenced this issue May 8, 2024
…rations with max_unmatched_batch_size (open-telemetry#32347)

**Description:** add example configurations for recombine operator with
`max_unmatched_batch_size`

**Link to tracking Issue:** open-telemetry#31653
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pkg/stanza
Projects
None yet
Development

No branches or pull requests

5 participants