-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial failover connector PR #27641
Conversation
b5e161b
to
769724f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this work @akats. I'm excited to see this PR.
At a high level it looks pretty good to me. However, we typically need new components submitted as a series of PRs in order to give it a proper review. At the bottom of this section, there is a suggested sequence which has worked well for reviewers.
if err == nil { | ||
consumers = append(consumers, newConsumer) | ||
} else { | ||
return errConsumer | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: prefer to return early in order to allow for less nested logic
if err == nil { | |
consumers = append(consumers, newConsumer) | |
} else { | |
return errConsumer | |
} | |
if err != nil { | |
return errConsumer | |
} | |
consumers = append(consumers, newConsumer) |
Agreed with @djaglowski , this is great to see coming along @akats ! Thank you so much! I was looking at the configuration choices and thinking about the Circuit Breaker pattern, which fits nicely with Fail-over behavior to help decide what to do with a failed pipeline. I think the "max_retry" could be replaced with Circuit Breaker pattern:
This is a pretty decent write-up on the pattern: |
@akats7 @sethallen, perhaps the right approach here, w/r/t retry or circuit breaker options, is to start with something very simple then add more sophisticated options once we have the component in place. |
@djaglowski Gotcha, I can break it up |
@sethallen Yep thanks for providing this, I'll look into it further, to @djaglowski's point once we have a decent enough MVP we can enhance |
I totally agree on simplifying and iterating, and the current PR has a lot of value in its current form. I'm looking to discuss terminology for this PR, rather than trying to get those features added right now. Should the internal code, and possibly config settings, be changed now so they can remain stable as the Overflow and Circuit Breaker features are introduced. For example, if the concept of "pipeline stability" was changed to "circuit state", using values:
This would make it easier to add the "halfOpen" circuit state in the future and introduce logic for determining how to retry a circuit and get it back to "closed" state. |
All of this can be addressed later. I suggest we get the basics in place before debating further. |
Sounds good. I'm all for getting the basics in place! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file shouldn't be deleted.
Just following up on the conversation, @akats7 are you planning to split up this PR? |
Hi @MovieStoreGuy, yep I’ll split it. Sorry for the pause, I’ll continue making the changes |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This is the Part 1 PR for the Failover Connector (split according to the CONTRIBUTING.md doc) Link to tracking Issue: #20766 Testing: Added factory test Note: Full functionality PR exists [here](#27641) and will likely be refactored to serve as the part 2 PR cc: @djaglowski @sethallen @MovieStoreGuy
This is the Part 1 PR for the Failover Connector (split according to the CONTRIBUTING.md doc) Link to tracking Issue: open-telemetry#20766 Testing: Added factory test Note: Full functionality PR exists [here](open-telemetry#27641) and will likely be refactored to serve as the part 2 PR cc: @djaglowski @sethallen @MovieStoreGuy
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
This is the initial PR for the failoverconnector which is a new component that will allow for health based routing between trace, metric, and log pipelines depending on the health of target downstream exporters.
At the moment the current mechanism for failover is consumer errors (see #20766).
The logic flow for the connector is as follows. The connector intakes a list of 'levels' each of which can contain multiple pipelines. If any pipeline at a stable level fails, the level is considered unhealthy and the connector will move down one priority level and route all data to the new level (assuming it is stable)
Once the connector has a set stable level, it will periodically try to reestablish a stable connection with the higher priority levels. For the retry there are two config params retryInterval and retryGap. RetryInterval will be the frequency at which the connector will try to iterate through all unhealthy higher priority levels while retryGap is how long it will wait after a failed retry at one level before retrying the next level (if retry gap is 2m, after trying to reestablish level 1, it will wait 2m before trying level 2) It will retry a maximum of one unhealthy level before returning to the current stable level. There is a max_retries config param as well that will track how many retries have occurred at each level, and once the max is hit, it will no longer retry that level. If a retry connection is reestablished after multiple failures, the retry count for that level will be reset.
An example config may look as follows:
Link to tracking Issue: Resolves #20766
Note: This is a working PR, I am going to make additional changes/refactor and still need to add test cases but wanted to socialize the initial PR sooner rather than later
cc: @djaglowski @atoulme @sethallen