Initial failover connector PR #27641

akats7 · 2023-10-12T02:19:54Z

This is the initial PR for the failoverconnector which is a new component that will allow for health based routing between trace, metric, and log pipelines depending on the health of target downstream exporters.

At the moment the current mechanism for failover is consumer errors (see #20766).

The logic flow for the connector is as follows. The connector intakes a list of 'levels' each of which can contain multiple pipelines. If any pipeline at a stable level fails, the level is considered unhealthy and the connector will move down one priority level and route all data to the new level (assuming it is stable)

Once the connector has a set stable level, it will periodically try to reestablish a stable connection with the higher priority levels. For the retry there are two config params retryInterval and retryGap. RetryInterval will be the frequency at which the connector will try to iterate through all unhealthy higher priority levels while retryGap is how long it will wait after a failed retry at one level before retrying the next level (if retry gap is 2m, after trying to reestablish level 1, it will wait 2m before trying level 2) It will retry a maximum of one unhealthy level before returning to the current stable level. There is a max_retries config param as well that will track how many retries have occurred at each level, and once the max is hit, it will no longer retry that level. If a retry connection is reestablished after multiple failures, the retry count for that level will be reset.

An example config may look as follows:

connectors:
  failover:
    priority:
      - [traces/first, traces/fourth]
      - [traces/second]
      - [traces/third]
    retry_pipeline_interval: 5m
    retry_gap: 1m
    max_retry: 10

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [failover]
    traces/first:
      receivers: [failover]
      exporters: [otlp/first]
    traces/second:
      receivers: [failover]
      exporters: [otlp/second]
    traces/third:
      receivers: [failover]
      exporters: [otlp/third]
    traces/fourth:
      receivers: [failover]
      exporters: [otlp/third]

Link to tracking Issue: Resolves #20766

Note: This is a working PR, I am going to make additional changes/refactor and still need to add test cases but wanted to socialize the initial PR sooner rather than later

cc: @djaglowski @atoulme @sethallen

djaglowski

Thanks for this work @akats. I'm excited to see this PR.

At a high level it looks pretty good to me. However, we typically need new components submitted as a series of PRs in order to give it a proper review. At the bottom of this section, there is a suggested sequence which has worked well for reviewers.

djaglowski · 2023-10-13T14:11:28Z

connector/failoverconnector/failover.go

+		if err == nil {
+			consumers = append(consumers, newConsumer)
+		} else {
+			return errConsumer
+		}


nit: prefer to return early in order to allow for less nested logic

Suggested change

if err == nil {

consumers = append(consumers, newConsumer)

} else {

return errConsumer

}

if err != nil {

return errConsumer

}

consumers = append(consumers, newConsumer)

sethallen · 2023-10-13T16:10:30Z

Agreed with @djaglowski , this is great to see coming along @akats ! Thank you so much!

I was looking at the configuration choices and thinking about the Circuit Breaker pattern, which fits nicely with Fail-over behavior to help decide what to do with a failed pipeline. I think the "max_retry" could be replaced with Circuit Breaker pattern:

Circuit is opened based on a circuit breaker policy (like exhausting a retry count)
Circuit is set to half-open to send a small trickle of data occasionally to see if it can successfully use that pipeline again
- Other data contains to use the currently successful pipeline, so the half-open pipeline is treated as a canary
Circuit is set to closed when an attempt during half-open state succeeds
Leverage the Persistent Queue of the Exporter Helper to provide options to "overflow" data when all pipelines are "open" and failing to accept data.
- This would add both a configurable Overflow option, as well as enable control of when to "toss" the data if all pipelines are failed and overflow file is full or not configured.

This is a pretty decent write-up on the pattern:
https://learn.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker

djaglowski · 2023-10-13T16:25:17Z

@akats7 @sethallen, perhaps the right approach here, w/r/t retry or circuit breaker options, is to start with something very simple then add more sophisticated options once we have the component in place.

akats7 · 2023-10-16T16:37:55Z

Thanks for this work @akats. I'm excited to see this PR.

At a high level it looks pretty good to me. However, we typically need new components submitted as a series of PRs in order to give it a proper review. At the bottom of this section, there is a suggested sequence which has worked well for reviewers.

@djaglowski Gotcha, I can break it up

akats7 · 2023-10-16T16:39:30Z

Agreed with @djaglowski , this is great to see coming along @akats ! Thank you so much!

I was looking at the configuration choices and thinking about the Circuit Breaker pattern, which fits nicely with Fail-over behavior to help decide what to do with a failed pipeline. I think the "max_retry" could be replaced with Circuit Breaker pattern:

Circuit is opened based on a circuit breaker policy (like exhausting a retry count)

Circuit is set to half-open to send a small trickle of data occasionally to see if it can successfully use that pipeline again

Other data contains to use the currently successful pipeline, so the half-open pipeline is treated as a canary

Circuit is set to closed when an attempt during half-open state succeeds

Leverage the Persistent Queue of the Exporter Helper to provide options to "overflow" data when all pipelines are "open" and failing to accept data.

This would add both a configurable Overflow option, as well as enable control of when to "toss" the data if all pipelines are failed and overflow file is full or not configured.

This is a pretty decent write-up on the pattern: https://learn.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker

@sethallen Yep thanks for providing this, I'll look into it further, to @djaglowski's point once we have a decent enough MVP we can enhance

sethallen · 2023-10-16T17:21:16Z

@akats7 @sethallen, perhaps the right approach here, w/r/t retry or circuit breaker options, is to start with something very simple then add more sophisticated options once we have the component in place.

I totally agree on simplifying and iterating, and the current PR has a lot of value in its current form. I'm looking to discuss terminology for this PR, rather than trying to get those features added right now.

Should the internal code, and possibly config settings, be changed now so they can remain stable as the Overflow and Circuit Breaker features are introduced. For example, if the concept of "pipeline stability" was changed to "circuit state", using values:

closed (indicates events flow through it due to stable pipeline)
open (indicates events no longer flow through it due to unstable pipeline)

This would make it easier to add the "halfOpen" circuit state in the future and introduce logic for determining how to retry a circuit and get it back to "closed" state.

djaglowski · 2023-10-16T17:51:07Z

I totally agree on simplifying and iterating, and the current PR has a lot of value in its current form. I'm looking to discuss terminology for this PR, rather than trying to get those features added right now.

Should the internal code, and possibly config settings, be changed now so they can remain stable as the Overflow and Circuit Breaker features are introduced. For example, if the concept of "pipeline stability" was changed to "circuit state", using values:

closed (indicates events flow through it due to stable pipeline)

open (indicates events no longer flow through it due to unstable pipeline)

This would make it easier to add the "halfOpen" circuit state in the future and introduce logic for determining how to retry a circuit and get it back to "closed" state.

All of this can be addressed later. I suggest we get the basics in place before debating further.

sethallen · 2023-10-16T19:10:24Z

I totally agree on simplifying and iterating, and the current PR has a lot of value in its current form. I'm looking to discuss terminology for this PR, rather than trying to get those features added right now.
Should the internal code, and possibly config settings, be changed now so they can remain stable as the Overflow and Circuit Breaker features are introduced. For example, if the concept of "pipeline stability" was changed to "circuit state", using values:

closed (indicates events flow through it due to stable pipeline)

open (indicates events no longer flow through it due to unstable pipeline)

This would make it easier to add the "halfOpen" circuit state in the future and introduce logic for determining how to retry a circuit and get it back to "closed" state.

All of this can be addressed later. I suggest we get the basics in place before debating further.

Sounds good. I'm all for getting the basics in place!

MovieStoreGuy · 2023-10-27T10:28:51Z

cmd/otelcontribcol/go.mod

This file shouldn't be deleted.

MovieStoreGuy · 2023-10-27T10:29:32Z

Just following up on the conversation, @akats7 are you planning to split up this PR?

akats7 · 2023-10-27T12:08:14Z

Just following up on the conversation, @akats7 are you planning to split up this PR?

Hi @MovieStoreGuy, yep I’ll split it. Sorry for the pause, I’ll continue making the changes

github-actions · 2023-11-11T05:19:36Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@djaglowski

This is the Part 1 PR for the Failover Connector (split according to the CONTRIBUTING.md doc) Link to tracking Issue: #20766 Testing: Added factory test Note: Full functionality PR exists [here](#27641) and will likely be refactored to serve as the part 2 PR cc: @djaglowski @sethallen @MovieStoreGuy

@djaglowski

This is the Part 1 PR for the Failover Connector (split according to the CONTRIBUTING.md doc) Link to tracking Issue: open-telemetry#20766 Testing: Added factory test Note: Full functionality PR exists [here](open-telemetry#27641) and will likely be refactored to serve as the part 2 PR cc: @djaglowski @sethallen @MovieStoreGuy

github-actions · 2023-11-26T05:19:16Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

akats7 requested a review from a team as a code owner October 12, 2023 02:19

akats7 requested a review from djaglowski October 12, 2023 02:19

github-actions bot assigned MovieStoreGuy Oct 12, 2023

github-actions bot added the cmd/otelcontribcol otelcontribcol command label Oct 12, 2023

initial failover connector PR

769724f

akats7 force-pushed the failoverconnector branch from b5e161b to 769724f Compare October 12, 2023 02:24

Merge branch 'main' into failoverconnector

1a14b17

djaglowski reviewed Oct 13, 2023

View reviewed changes

MovieStoreGuy reviewed Oct 27, 2023

View reviewed changes

cmd/otelcontribcol/go.mod

Copy link

Contributor

MovieStoreGuy Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file shouldn't be deleted.

akats7 mentioned this pull request Oct 31, 2023

First PR - Failover Connector skeleton #28818

Merged

github-actions bot added the Stale label Nov 11, 2023

github-actions bot closed this Nov 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial failover connector PR #27641

Initial failover connector PR #27641

akats7 commented Oct 12, 2023 •

edited

djaglowski left a comment

djaglowski Oct 13, 2023

sethallen commented Oct 13, 2023

djaglowski commented Oct 13, 2023

akats7 commented Oct 16, 2023

akats7 commented Oct 16, 2023

sethallen commented Oct 16, 2023 •

edited

djaglowski commented Oct 16, 2023

sethallen commented Oct 16, 2023

MovieStoreGuy Oct 27, 2023

MovieStoreGuy commented Oct 27, 2023

akats7 commented Oct 27, 2023

github-actions bot commented Nov 11, 2023

github-actions bot commented Nov 26, 2023

Initial failover connector PR #27641

Initial failover connector PR #27641

Conversation

akats7 commented Oct 12, 2023 • edited

djaglowski left a comment

Choose a reason for hiding this comment

djaglowski Oct 13, 2023

Choose a reason for hiding this comment

sethallen commented Oct 13, 2023

djaglowski commented Oct 13, 2023

akats7 commented Oct 16, 2023

akats7 commented Oct 16, 2023

sethallen commented Oct 16, 2023 • edited

djaglowski commented Oct 16, 2023

sethallen commented Oct 16, 2023

MovieStoreGuy Oct 27, 2023

Choose a reason for hiding this comment

MovieStoreGuy commented Oct 27, 2023

akats7 commented Oct 27, 2023

github-actions bot commented Nov 11, 2023

github-actions bot commented Nov 26, 2023

akats7 commented Oct 12, 2023 •

edited

sethallen commented Oct 16, 2023 •

edited