Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple fixes in actors package #7119

Merged
merged 40 commits into from Nov 21, 2023

Conversation

ItalyPaleAle
Copy link
Contributor

@ItalyPaleAle ItalyPaleAle commented Oct 30, 2023

This PR contains some fixes for actors. In order from the most "serious" one to the least:

  1. If the connection to Placement goes down, which can happen because of a Placement failure (initiated by the server) or if the app becomes unhealthy (initiated by the client), deactivate all actors hosted by the current runtime/app. This is because when the placement service doesn't see the app as connected, it will direct actors to other hosts. Without telling the app that the actor has been deactivated, it can cause an actor to be active in 2 separate services.
    • This is now verified through both a unit test and an integration test. In the IT, we shut down the placement service and verify that all hosted actors receive a deactivation signal.
  2. Placement now waits for the first dissemination of tables to report readiness status. This helps actor clients as a follow-up from [release-1.12] Fix race condition in app health and actors/workflow initialization #6972: with that PR, we made all calls to actor methods blocking if the actor runtime wasn't ready. However, if the runtime is ready, but placement tables haven't been received yet, attempts to look up actors will fail, causing actor invocation to fail.
    • This is verified through unit tests. Sadly, writing an IT for this isn't possible at this stage, as it would require creating a "mock" placement where we can control when tables are disseminated.
  3. Related to the above, if the connection to Placement drops (because of a network failure or because the app becomes unhealthy), the placement tables are purged from in-memory. This is because otherwise when the service comes back online, actor invocations would use stale placement data. Now, they too have to wait for the first dissemination to happen.
    • Covered by the same tests as above
  4. Fixed an issue in the unit tests of workflow HTTP APIs that caused them to take a very long time because they were blocked on waiting for the actor runtime's initialization (issue impacted the tests only)
  5. Improved the shutdown sequence of the actors object, to better report errors in the Close method.

1. Deactivate all actors in the app if the placement disconnects
2. Actor object now records the idle time rather than the last activation time
3. Remove actors from the active actors table even if deactivation fails
4. Improved shutdown of actors object

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
…ting readiness

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
@ItalyPaleAle ItalyPaleAle requested review from a team as code owners October 30, 2023 23:41
@ItalyPaleAle
Copy link
Contributor Author

/ok-to-test

@ItalyPaleAle
Copy link
Contributor Author

/test-sdk-all

@dapr-bot
Copy link
Collaborator

dapr-bot commented Oct 30, 2023

Dapr E2E test

🔗 Link to Action run

Commit ref: 1488faa

✅ Build succeeded for linux/amd64

  • Image tag: dapre2e822ca69582l
  • Test image tag: dapre2e822ca69582l

✅ Infrastructure deployed

Cluster Resource group name Azure region
Linux Dapr-E2E-dapre2e822ca69582l westus3
Windows Dapr-E2E-dapre2e822ca69582w westus3
Linux/arm64 Dapr-E2E-dapre2e822ca69582la eastus

✅ Build succeeded for windows/amd64

  • Image tag: dapre2e822ca69582w
  • Test image tag: dapre2e822ca69582w

✅ Tests succeeded on windows/amd64

  • Image tag: dapre2e822ca69582w
  • Test image tag: dapre2e822ca69582w

✅ Tests succeeded on linux/amd64

  • Image tag: dapre2e822ca69582l
  • Test image tag: dapre2e822ca69582l

@dapr-bot
Copy link
Collaborator

dapr-bot commented Oct 30, 2023

Dapr SDK Python test

🔗 Link to Action run

Commit ref: 1488faa

✅ Python SDK tests passed

@dapr-bot
Copy link
Collaborator

dapr-bot commented Oct 30, 2023

Dapr SDK Go test

🔗 Link to Action run

Commit ref: 1488faa

✅ Go SDK tests passed

@dapr-bot
Copy link
Collaborator

dapr-bot commented Oct 30, 2023

Dapr SDK Java test

🔗 Link to Action run

Commit ref: 1488faa

❌ Java SDK tests failed

Please check the logs for details on the error.

@dapr-bot
Copy link
Collaborator

dapr-bot commented Oct 30, 2023

Dapr SDK JS test

🔗 Link to Action run

Commit ref: 1488faa

❌ JS SDK tests failed

Please check the logs for details on the error.

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Signed-off-by: Alessandro (Ale) Segala <43508+ItalyPaleAle@users.noreply.github.com>
@codecov
Copy link

codecov bot commented Oct 31, 2023

Codecov Report

Attention: 63 lines in your changes are missing coverage. Please review.

Comparison is base (70f5fd6) 65.00% compared to head (4f2ebdb) 64.86%.

Files Patch % Lines
pkg/actors/actors.go 40.00% 40 Missing and 2 partials ⚠️
pkg/actors/placement/placement.go 64.40% 17 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7119      +/-   ##
==========================================
- Coverage   65.00%   64.86%   -0.15%     
==========================================
  Files         221      221              
  Lines       20924    21003      +79     
==========================================
+ Hits        13602    13623      +21     
- Misses       6177     6225      +48     
- Partials     1145     1155      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@JoshVanL JoshVanL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass from me 🙂

pkg/actors/actors.go Show resolved Hide resolved
pkg/actors/actors.go Show resolved Hide resolved
pkg/actors/actors.go Show resolved Hide resolved
pkg/actors/actors.go Show resolved Hide resolved
pkg/actors/placement/placement.go Show resolved Hide resolved
Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
@ItalyPaleAle ItalyPaleAle added the autoupdate DaprBot will keep the Pull Request up to date with master branch label Nov 1, 2023
@yaron2
Copy link
Member

yaron2 commented Nov 16, 2023

@JoshVanL please review again :)

@JoshVanL
Copy link
Contributor

@yaron2 along with https://github.com/dapr/dapr/pull/7119/files#diff-7ec21ca80e96c8603d65ab85ff72d4e3ef479f9cff11eb09f993182e1e91ce3a I think we need another int tests which checks that an actor is deactivated when the app reports unhealthy. This is contingent on #7022 and #6216

…ctivation-refactorings

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
@ItalyPaleAle
Copy link
Contributor Author

I have fixed the merge conflicts. Is there anything else blocking this PR?

Feedback on adding a new test is valid but not actionable at this stage.

@JoshVanL
Copy link
Contributor

Yes, I think we should block until we have full test coverage for this, i.e. the app unhealthy actor deregister.

@ItalyPaleAle
Copy link
Contributor Author

Can it be done in a subsequent PR?

@yaron2 yaron2 merged commit 6d000be into dapr:master Nov 21, 2023
19 of 22 checks passed
@yaron2
Copy link
Member

yaron2 commented Nov 21, 2023

Can it be done in a subsequent PR?

Yes, please follow up on this @ItalyPaleAle

@ItalyPaleAle ItalyPaleAle deleted the actors-deactivation-refactorings branch November 21, 2023 18:54
@ItalyPaleAle
Copy link
Contributor Author

@JoshVanL I was looking into adding a test based on your PR. Your PR reduces the healthcheck interval, but still requires 4 failed healthchecks, each on a 5s interval, for an app to be unhealthy. That makes the test still take at least 30s. Is there a way to make that faster?

@JoshVanL
Copy link
Contributor

@ItalyPaleAle #7119 (comment)

Yes, I think the current settings don't make much sense for even a "real" scenario and should be changed.

@ItalyPaleAle
Copy link
Contributor Author

@ItalyPaleAle #7119 (comment)

Yes, I think the current settings don't make much sense for even a "real" scenario and should be changed.

Aside from "real" scenarios, perhaps there could be a way to tweak that (not sure how) so tests can set even lower values

@JoshVanL
Copy link
Contributor

@ItalyPaleAle Unifying the health checks would make them configurable with the same CLI flags and therefore more testable.

@ItalyPaleAle
Copy link
Contributor Author

@ItalyPaleAle Unifying the health checks would make them configurable with the same CLI flags and therefore more testable.

Makes sense. Will wait for that then

@ItalyPaleAle ItalyPaleAle added this to the v1.13 milestone Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autoupdate DaprBot will keep the Pull Request up to date with master branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants