Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial failover connector PR #27641

Closed
wants to merge 2 commits into from

Conversation

akats7
Copy link
Contributor

@akats7 akats7 commented Oct 12, 2023

This is the initial PR for the failoverconnector which is a new component that will allow for health based routing between trace, metric, and log pipelines depending on the health of target downstream exporters.

At the moment the current mechanism for failover is consumer errors (see #20766).

The logic flow for the connector is as follows. The connector intakes a list of 'levels' each of which can contain multiple pipelines. If any pipeline at a stable level fails, the level is considered unhealthy and the connector will move down one priority level and route all data to the new level (assuming it is stable)

Once the connector has a set stable level, it will periodically try to reestablish a stable connection with the higher priority levels. For the retry there are two config params retryInterval and retryGap. RetryInterval will be the frequency at which the connector will try to iterate through all unhealthy higher priority levels while retryGap is how long it will wait after a failed retry at one level before retrying the next level (if retry gap is 2m, after trying to reestablish level 1, it will wait 2m before trying level 2) It will retry a maximum of one unhealthy level before returning to the current stable level. There is a max_retries config param as well that will track how many retries have occurred at each level, and once the max is hit, it will no longer retry that level. If a retry connection is reestablished after multiple failures, the retry count for that level will be reset.

An example config may look as follows:

connectors:
  failover:
    priority:
      - [traces/first, traces/fourth]
      - [traces/second]
      - [traces/third]
    retry_pipeline_interval: 5m
    retry_gap: 1m
    max_retry: 10

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [failover]
    traces/first:
      receivers: [failover]
      exporters: [otlp/first]
    traces/second:
      receivers: [failover]
      exporters: [otlp/second]
    traces/third:
      receivers: [failover]
      exporters: [otlp/third]
    traces/fourth:
      receivers: [failover]
      exporters: [otlp/third]

Link to tracking Issue: Resolves #20766

Note: This is a working PR, I am going to make additional changes/refactor and still need to add test cases but wanted to socialize the initial PR sooner rather than later

cc: @djaglowski @atoulme @sethallen

@akats7 akats7 requested a review from a team as a code owner October 12, 2023 02:19
@github-actions github-actions bot added the cmd/otelcontribcol otelcontribcol command label Oct 12, 2023
Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this work @akats. I'm excited to see this PR.

At a high level it looks pretty good to me. However, we typically need new components submitted as a series of PRs in order to give it a proper review. At the bottom of this section, there is a suggested sequence which has worked well for reviewers.

Comment on lines +43 to +47
if err == nil {
consumers = append(consumers, newConsumer)
} else {
return errConsumer
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer to return early in order to allow for less nested logic

Suggested change
if err == nil {
consumers = append(consumers, newConsumer)
} else {
return errConsumer
}
if err != nil {
return errConsumer
}
consumers = append(consumers, newConsumer)

@sethallen
Copy link

Agreed with @djaglowski , this is great to see coming along @akats ! Thank you so much!

I was looking at the configuration choices and thinking about the Circuit Breaker pattern, which fits nicely with Fail-over behavior to help decide what to do with a failed pipeline. I think the "max_retry" could be replaced with Circuit Breaker pattern:

  • Circuit is opened based on a circuit breaker policy (like exhausting a retry count)
  • Circuit is set to half-open to send a small trickle of data occasionally to see if it can successfully use that pipeline again
    • Other data contains to use the currently successful pipeline, so the half-open pipeline is treated as a canary
  • Circuit is set to closed when an attempt during half-open state succeeds
  • Leverage the Persistent Queue of the Exporter Helper to provide options to "overflow" data when all pipelines are "open" and failing to accept data.
    • This would add both a configurable Overflow option, as well as enable control of when to "toss" the data if all pipelines are failed and overflow file is full or not configured.

This is a pretty decent write-up on the pattern:
https://learn.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker

@djaglowski
Copy link
Member

@akats7 @sethallen, perhaps the right approach here, w/r/t retry or circuit breaker options, is to start with something very simple then add more sophisticated options once we have the component in place.

@akats7
Copy link
Contributor Author

akats7 commented Oct 16, 2023

Thanks for this work @akats. I'm excited to see this PR.

At a high level it looks pretty good to me. However, we typically need new components submitted as a series of PRs in order to give it a proper review. At the bottom of this section, there is a suggested sequence which has worked well for reviewers.

@djaglowski Gotcha, I can break it up

@akats7
Copy link
Contributor Author

akats7 commented Oct 16, 2023

Agreed with @djaglowski , this is great to see coming along @akats ! Thank you so much!

I was looking at the configuration choices and thinking about the Circuit Breaker pattern, which fits nicely with Fail-over behavior to help decide what to do with a failed pipeline. I think the "max_retry" could be replaced with Circuit Breaker pattern:

  • Circuit is opened based on a circuit breaker policy (like exhausting a retry count)

  • Circuit is set to half-open to send a small trickle of data occasionally to see if it can successfully use that pipeline again

    • Other data contains to use the currently successful pipeline, so the half-open pipeline is treated as a canary
  • Circuit is set to closed when an attempt during half-open state succeeds

  • Leverage the Persistent Queue of the Exporter Helper to provide options to "overflow" data when all pipelines are "open" and failing to accept data.

    • This would add both a configurable Overflow option, as well as enable control of when to "toss" the data if all pipelines are failed and overflow file is full or not configured.

This is a pretty decent write-up on the pattern: https://learn.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker

@sethallen Yep thanks for providing this, I'll look into it further, to @djaglowski's point once we have a decent enough MVP we can enhance

@sethallen
Copy link

sethallen commented Oct 16, 2023

@akats7 @sethallen, perhaps the right approach here, w/r/t retry or circuit breaker options, is to start with something very simple then add more sophisticated options once we have the component in place.

I totally agree on simplifying and iterating, and the current PR has a lot of value in its current form. I'm looking to discuss terminology for this PR, rather than trying to get those features added right now.

Should the internal code, and possibly config settings, be changed now so they can remain stable as the Overflow and Circuit Breaker features are introduced. For example, if the concept of "pipeline stability" was changed to "circuit state", using values:

  • closed (indicates events flow through it due to stable pipeline)
  • open (indicates events no longer flow through it due to unstable pipeline)

This would make it easier to add the "halfOpen" circuit state in the future and introduce logic for determining how to retry a circuit and get it back to "closed" state.

@djaglowski
Copy link
Member

I totally agree on simplifying and iterating, and the current PR has a lot of value in its current form. I'm looking to discuss terminology for this PR, rather than trying to get those features added right now.

Should the internal code, and possibly config settings, be changed now so they can remain stable as the Overflow and Circuit Breaker features are introduced. For example, if the concept of "pipeline stability" was changed to "circuit state", using values:

  • closed (indicates events flow through it due to stable pipeline)
  • open (indicates events no longer flow through it due to unstable pipeline)

This would make it easier to add the "halfOpen" circuit state in the future and introduce logic for determining how to retry a circuit and get it back to "closed" state.

All of this can be addressed later. I suggest we get the basics in place before debating further.

@sethallen
Copy link

I totally agree on simplifying and iterating, and the current PR has a lot of value in its current form. I'm looking to discuss terminology for this PR, rather than trying to get those features added right now.
Should the internal code, and possibly config settings, be changed now so they can remain stable as the Overflow and Circuit Breaker features are introduced. For example, if the concept of "pipeline stability" was changed to "circuit state", using values:

  • closed (indicates events flow through it due to stable pipeline)
  • open (indicates events no longer flow through it due to unstable pipeline)

This would make it easier to add the "halfOpen" circuit state in the future and introduce logic for determining how to retry a circuit and get it back to "closed" state.

All of this can be addressed later. I suggest we get the basics in place before debating further.

Sounds good. I'm all for getting the basics in place!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file shouldn't be deleted.

@MovieStoreGuy
Copy link
Contributor

Just following up on the conversation, @akats7 are you planning to split up this PR?

@akats7
Copy link
Contributor Author

akats7 commented Oct 27, 2023

Just following up on the conversation, @akats7 are you planning to split up this PR?

Hi @MovieStoreGuy, yep I’ll split it. Sorry for the pause, I’ll continue making the changes

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Nov 11, 2023
djaglowski pushed a commit that referenced this pull request Nov 15, 2023
This is the Part 1 PR for the Failover Connector (split according to the
CONTRIBUTING.md doc)

Link to tracking Issue: #20766 

Testing: Added factory test

Note: Full functionality PR exists
[here](#27641)
and will likely be refactored to serve as the part 2 PR

cc: @djaglowski @sethallen @MovieStoreGuy
RoryCrispin pushed a commit to ClickHouse/opentelemetry-collector-contrib that referenced this pull request Nov 24, 2023
This is the Part 1 PR for the Failover Connector (split according to the
CONTRIBUTING.md doc)

Link to tracking Issue: open-telemetry#20766 

Testing: Added factory test

Note: Full functionality PR exists
[here](open-telemetry#27641)
and will likely be refactored to serve as the part 2 PR

cc: @djaglowski @sethallen @MovieStoreGuy
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Nov 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cmd/otelcontribcol otelcontribcol command Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New component: Failover Connector
4 participants