Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix e2e EnvoyShutdown #3283

Closed
wants to merge 6 commits into from
Closed

Conversation

shawnh2
Copy link
Contributor

@shawnh2 shawnh2 commented Apr 26, 2024

What type of PR is this?

What this PR does / why we need it:

  • add test into codecov ignore path
  • trying to fix flaky: e2e TestEGUpgrade/EnvoyShutdown #3262 by increasing the HTTPRoute Duration and move load generation after the deployment has been restarted
  • fix the order of merge-gateways e2e test in makefile

Which issue(s) this PR fixes:

Fixes #3262

Signed-off-by: shawnh2 <shawnhxh@outlook.com>
Signed-off-by: shawnh2 <shawnhxh@outlook.com>
@shawnh2 shawnh2 requested a review from a team as a code owner April 26, 2024 08:47
@shawnh2 shawnh2 marked this pull request as draft April 26, 2024 09:12
@shawnh2 shawnh2 marked this pull request as ready for review April 28, 2024 09:59
@shawnh2
Copy link
Contributor Author

shawnh2 commented Apr 28, 2024

/retest

@guydc
Copy link
Contributor

guydc commented Apr 29, 2024

move load generation after the deployment has been restarted

Hi @shawnh2! The purpose of the test is to run load during the restart and assert that there's no loss of requests... Running it after restart means that we're not really testing for hitless shutdown anymore..

@shawnh2
Copy link
Contributor Author

shawnh2 commented Apr 30, 2024

move load generation after the deployment has been restarted

Hi @shawnh2! The purpose of the test is to run load during the restart and assert that there's no loss of requests... Running it after restart means that we're not really testing for hitless shutdown anymore..

Thanks for the clarification, by root causing this e2e test case, sometimes after the restart process, the url that one request send to may be a 404 path, meaning there are some loss requests.

Maybe by sending another request to detect the url path right after the restart process can avoid this situation.

@alexwo
Copy link
Contributor

alexwo commented Apr 30, 2024

move load generation after the deployment has been restarted

Hi @shawnh2! The purpose of the test is to run load during the restart and assert that there's no loss of requests... Running it after restart means that we're not really testing for hitless shutdown anymore..

Thanks for the clarification, by root causing this e2e test case, sometimes after the restart process, the url that one request send to may be a 404 path, meaning there are some loss requests.

Maybe by sending another request to detect the url path right after the restart process can avoid this situation.

Maybe the upgrade process isn't hitless anymore, I hadn't noticed the 404 issue before, so perhaps that's a new thing. Perhaps it's worth implementing a temporary workaround to prevent instability, but we should also address the underlying issue that causes some requests to result in a 404 error during an upgrade.

@guydc
Copy link
Contributor

guydc commented Apr 30, 2024

Hi @shawnh2 , @alexwo - that's interesting input. Yes, I assume that if failures are 404s, then we have a problem with new envoy proxies receiving traffic before they are programmed.

In that case, I think we can split this activity to two parts:

  • Run load and log load stats, but only assert on success of converging single-user requests. Basically, only make sure that restarted proxies are eventually functional and stop this test from being flaky.
  • Investigate how to fix the underlying issue causing 404s.

WDYT?

@alexwo
Copy link
Contributor

alexwo commented Apr 30, 2024

Hi @shawnh2 , @alexwo - that's interesting input. Yes, I assume that if failures are 404s, then we have a problem with new envoy proxies receiving traffic before they are programmed.

In that case, I think we can split this activity to two parts:

  • Run load and log load stats, but only assert on success of converging single-user requests. Basically, only make sure that restarted proxies are eventually functional and stop this test from being flaky.
  • Investigate how to fix the underlying issue causing 404s.

WDYT?

It's also possible that existing proxies fail, as they may refresh their cache when attempting to obtain a new cache snapshot version from newly started EGs. Maybe something as initialDelaySeconds suggested here can help:
#2810 (comment)

Or potentially:
#2918

I'm not sure that EG infra controller, will re-deploy proxies in every EG upgrade test, perhaps only if there are changes that require re-deployment?

@zirain
Copy link
Contributor

zirain commented Apr 30, 2024

this test skiped in #3306, can you remove this SkipTests?

tests.EnvoyShutdownTest.ShortName, // https://github.com/envoyproxy/gateway/issues/3262

@arkodg
Copy link
Contributor

arkodg commented May 16, 2024

hey @shawnh2 is this PR still needed ?

@shawnh2
Copy link
Contributor Author

shawnh2 commented May 17, 2024

close this for now, will raise a fix if we have a clue.

@shawnh2 shawnh2 closed this May 17, 2024
@shawnh2 shawnh2 deleted the fix-e2e-envoyshutdown branch May 17, 2024 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

flaky: e2e TestEGUpgrade/EnvoyShutdown
5 participants