Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New component: Failover Connector #20766

Open
2 tasks
djaglowski opened this issue Apr 8, 2023 · 20 comments
Open
2 tasks

New component: Failover Connector #20766

djaglowski opened this issue Apr 8, 2023 · 20 comments
Assignees
Labels
Accepted Component New component has been sponsored

Comments

@djaglowski
Copy link
Member

The purpose and use-cases of the new component

A connector that routes data based on the current health status of a downstream component, typically an exporter.

I have heard several users ask for the ability to send data to a backup exporter, if a primary exporter fails. I believe this could be implemented as a routing connector.

The user would specify at least one pipeline to which data would typically be routed. Additionally, the user must specify at least one backup pipeline or pipelines which would be used when an error is encountered.

Initially, I think the trigger for routing to a backup pipeline could be based on backpropogated errors, though this is not yet very robust (See open-telemetry/opentelemetry-collector#7460). At a later time, I imagine this could be based on the health status of an exporter (See open-telemetry/opentelemetry-collector#6344).

Example configuration for the component

receivers:
  foo:

exporters:
  bar/main:
  bar/backup:

connectors:
  failover:
    primary: logs/main
    secondary: logs/backup

service:
  pipelines:
    logs/in:
       receivers: [foo]
       exporters: [failover]
    logs/main:
      receivers: [failover]
      exporters: [bar/main]
    logs/backup:
      receivers: [failover]
      exporters: [bar/backup]

Telemetry data types supported

traces->traces
metrics->metrics
logs->logs

Is this a vendor-specific component?

  • This is a vendor-specific component
  • If this is a vendor-specific component, I am proposing to contribute this as a representative of the vendor.

Sponsor (optional)

No response

Additional context

No response

@djaglowski djaglowski added Sponsor Needed New component seeking sponsor needs triage New item requiring triage labels Apr 8, 2023
@sethallen
Copy link

sethallen commented Apr 9, 2023

I'm glad you made this @djaglowski! I was just chatting with @atoulme about adding failover and circuit breaker support for exporters a couple days ago. The connector seems like a great method to add broad failover support.

How about tweaking this slightly to support 1..N entries as a yaml flow sequence? It would reduce complexity in the failover connector by removing the need for keys (primary, secondary, etc.) in order to choose the next pipeline to failover to.

Example:

receivers:
  foo:

exporters:
  bar/main:
  bar/backup:
  bar/backup2:

connectors:
  failover: [logs/main, logs/backup, logs/backup2, .. n]
#    primary: logs/main
#    secondary: logs/backup

service:
  pipelines:
    logs/in:
       receivers: [foo]
       exporters: [failover]
    logs/main:
      receivers: [failover]
      exporters: [bar/main]
    logs/backup:
      receivers: [failover]
      exporters: [bar/backup]
    logs/backup2:
      receivers: [failover]
      exporters: [bar/backup2]

@fatsheep9146
Copy link
Contributor

I'd like to sponsor this.

@djaglowski
Copy link
Member Author

@sethallen, I like the idea of allowing a priority list, but I think we should leave room for other parameters as well. I also think we need to allow multiple pipelines per "level".

connectors:
  failover: 
    priority:
      - [logs/main]
      - [logs/backup, logs/backup2]
      - [logs/backup/3]
    min_failover_interval: 2m # Possibly would add this in future

@djaglowski djaglowski added Accepted Component New component has been sponsored and removed Sponsor Needed New component seeking sponsor needs triage New item requiring triage labels Apr 10, 2023
@cparkins
Copy link
Contributor

@djaglowski How would the multiple pipelines be used? In a fan-out or Priority 1, Priority 2-1, Priority 2-2, ... Priority N

@djaglowski
Copy link
Member Author

@cparkins, when there are multiple pipelines at the same priority level, it would fan out data to those pipelines.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@akats7
Copy link
Contributor

akats7 commented Jul 12, 2023

@djaglowski I also had this feature request, I'd be happy to work on it/ support any way I can.

@djaglowski
Copy link
Member Author

@akats7, any help moving this forward would be great. I'll be happy to review any PRs.

@akats7
Copy link
Contributor

akats7 commented Jul 12, 2023

@djaglowski sounds good, can I please be assigned this issue.

@sethallen
Copy link

@djaglowski / @akats7 / @atoulme - Perhaps this work effort can be merged with what @cparkins has been working on internally for us over the last few months. He added resiliency features (Failover, Circuit Breaker) to the Splunk HEC Exporter for the OTel Collector and submitted them in the PR below:

@djaglowski
Copy link
Member Author

@sethallen, I'm supportive of the idea. In my opinion, failover at least should be implemented as a connector because in many cases it may be appropriate to failover to a different type of exporter. If I recall correctly, you and/or @cparkins looked into the idea of implementing other resiliency features into a connector. Do you still see that as a viable path? Either way, I think the failover connector should move forward and we can add additional capabilities based on a proposal.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Sep 11, 2023
@akats7
Copy link
Contributor

akats7 commented Sep 11, 2023

^ I was able to begin looking into this recently and will open a first pass PR for this shortly.

@github-actions github-actions bot removed the Stale label Sep 11, 2023
@sethallen
Copy link

That's exciting @akats7. We've been maintaining an internal fork of resiliency features added to the Splunk HEC Exporter and would love to get these features somewhere into the mainline collector. Your PR for a Connector will be great to see and hopefully help with. Cheers!

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Nov 13, 2023
djaglowski pushed a commit that referenced this issue Nov 15, 2023
This is the Part 1 PR for the Failover Connector (split according to the
CONTRIBUTING.md doc)

Link to tracking Issue: #20766 

Testing: Added factory test

Note: Full functionality PR exists
[here](#27641)
and will likely be refactored to serve as the part 2 PR

cc: @djaglowski @sethallen @MovieStoreGuy
RoryCrispin pushed a commit to ClickHouse/opentelemetry-collector-contrib that referenced this issue Nov 24, 2023
This is the Part 1 PR for the Failover Connector (split according to the
CONTRIBUTING.md doc)

Link to tracking Issue: open-telemetry#20766 

Testing: Added factory test

Note: Full functionality PR exists
[here](open-telemetry#27641)
and will likely be refactored to serve as the part 2 PR

cc: @djaglowski @sethallen @MovieStoreGuy
djaglowski pushed a commit that referenced this issue Dec 12, 2023
This is the 2nd PR for the failover connector that implements the core
failover functionality. It is currently in place for Traces and once
solidified will be repeated for metrics and logs

Link to tracking Issue: #20766

Note: Will add traces tests today but pushing up to begin review

cc: @djaglowski @fatsheep9146
djaglowski pushed a commit that referenced this issue Jan 8, 2024
This is the 3rd PR for the failover connector. This PR adds support for
metric and log pipelines

Link to tracking Issue:
#20766

cc: @djaglowski @fatsheep9146
cparkins pushed a commit to AmadeusITGroup/opentelemetry-collector-contrib that referenced this issue Jan 10, 2024
This is the 3rd PR for the failover connector. This PR adds support for
metric and log pipelines

Link to tracking Issue:
open-telemetry#20766

cc: @djaglowski @fatsheep9146
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Jan 15, 2024
@djaglowski djaglowski added Accepted Component New component has been sponsored and removed Accepted Component New component has been sponsored labels Jan 16, 2024
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Mar 18, 2024
@verejoel
Copy link

Would love to revive this, definitely interested in this topic.

@djaglowski
Copy link
Member Author

Thanks for pinging this @verejoel.

An implementation is in place but stability is still marked as development. @akats7, do you recall what is left to do? If we have a minimally functional component then I think we should move it to alpha status, close this issue and open issues for any additional functionality we would like.

@akats7
Copy link
Contributor

akats7 commented Mar 25, 2024

Hey @djaglowski @verejoel,

Yep the MVP functionality is in place, I did have one more change I've been planning to push so I'll push that along with the update to Alpha.

@github-actions github-actions bot removed the Stale label Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Component New component has been sponsored
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants