Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send EDS response in ADS mode when Envoy reconnect and asks with Empty version #127

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lukidzi
Copy link
Contributor

@lukidzi lukidzi commented Jan 30, 2020

We found that Envoy has a cluster in warming state after reconnect to new control-plane.
So let's assume Envoy is connected to Control-plane1. Control-plane1 receive information about the new cluster[Cluster1] and send CDS response to Envoy using ADS. Control-plane1 die and at the same time cluster[Cluster1] has been removed. Envoy connects to Control-plane2 and send[EDS,LDS,RDS] requests. Because Envoy didn't received response EDS from Control-plane1 this cluster is in warming state. EDS request from Envoy to Control-plane2contains resource name of removed cluster[Cluster1] and Control-plane2 doesn't know about the cluster which was sent by Control-plane1 to Envoy. Because ADS implementation requires to send only response when all requested resources in snapshot Control-plane2 won’t respond. In this situation Envoy stay in warming phase for ever.

@codecov-io
Copy link

Codecov Report

Merging #127 into master will increase coverage by 0.28%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #127      +/-   ##
============================================
+ Coverage      89.5%   89.78%   +0.28%     
- Complexity      193      198       +5     
============================================
  Files            22       22              
  Lines           600      617      +17     
  Branches         48       53       +5     
============================================
+ Hits            537      554      +17     
  Misses           48       48              
  Partials         15       15
Impacted Files Coverage Δ Complexity Δ
.../io/envoyproxy/controlplane/cache/SimpleCache.java 84.84% <100%> (+2.23%) 35 <0> (+5) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 046ab2b...6b1a5f1. Read the comment docs.

…y version

Signed-off-by: Lukasz Dziedziak <lukasz.dziedziak@allegro.pl>
@slonka
Copy link
Member

slonka commented Sep 22, 2020

@lukidzi is this still relevant?

@mpuncel
Copy link
Contributor

mpuncel commented Sep 28, 2020

This does seem like a race that could get Envoy stuck, so it probably is worth merging in one form or another.

I think probably the logic could be:

  1. If there is a "missing" ClusterLoadAssignment that Envoy asked for, check if there is a corresponding cluster in the Snapshot. If there is not, respond with an empty EDS message (because probably it's this race, where the cluster was deleted but Envoy didn't see a CDS update removing it).

  2. If the cluster DOES exist in the snapshot, then either 1) the caller generated an inconsistent Snapshot as a bug or 2) the caller will provide a subsequent Snapshot that has the endpoints. I think in either of these cases, we could respond with an empty EDS if the acked version is "", otherwise we should keep the current behavior of not responding. I could also be convinced to just keep the current behavior of not responding, although it's probably better to return something to envoy so other clusters/endpoints don't get stuck on a single bad/missing resource.

Base automatically changed from master to main January 16, 2021 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants