Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error during Dapr Shutdown and Message Polling Behavior Issue (AWS) #3156

Closed
tomatbay opened this issue Sep 27, 2023 · 6 comments · Fixed by #3174
Closed

Error during Dapr Shutdown and Message Polling Behavior Issue (AWS) #3156

tomatbay opened this issue Sep 27, 2023 · 6 comments · Fixed by #3174
Assignees
Labels
kind/bug Something isn't working
Milestone

Comments

@tomatbay
Copy link

Expected Behavior

I expect that as the number of subscriptions in Dapr increases, the following behavior should occur:

Dapr should only poll messages from subscriptions that have completed their registration process, ensuring no errors related to incomplete registration occur.
During shutdown, Dapr should gracefully stop polling messages and complete the shutdown process without errors.

Actual Behavior

The actual behavior observed is as follows:

When Dapr initializes during startup and there are multiple subscriptions, it begins polling messages from the first subscription in the list. However, this behavior can include messages from subscriptions that may not have completed their registration process, leading to the occurrence of errors.

Occasionally, during shutdown, Dapr continues to attempt message polling even after the associated topics have been unregistered. This behavior leads to the same error related to message handling, and the error message is as follows:

error while handling received message. error is: handler for topic (sanitized):

Steps to Reproduce the Problem

To replicate this problem, you can follow these steps:

  1. Set up Dapr with multiple subscriptions.

  2. Populate the SQS (Simple Queue Service) with a variety of messages.

  3. Launch Dapr and monitor its startup behavior. Take note that Dapr might initiate message polling from subscriptions that haven't finished their registration, resulting in potential errors.

  4. On occasion, during the shutdown process, observe Dapr's behavior as it may persist in attempting to poll messages even after the relevant topics have been unregistered. This persistence can trigger the aforementioned error message.

  • Please note that the issue is intermittent and may not occur consistently during every run.
@tomatbay tomatbay added the kind/bug Something isn't working label Sep 27, 2023
@tomatbay
Copy link
Author

/assign @amimimor

@tomatbay tomatbay reopened this Sep 27, 2023
@tomatbay tomatbay removed their assignment Sep 27, 2023
@tomatbay
Copy link
Author

@maintainers
could you please assign @amimimor for this bug

@amimimor
Copy link
Contributor

I'd like to highlight a potential additional issue with the current implementation of the subscription feature. Currently, the mapping between topics and handlers is managed by a synchronous standard library map, which is protected against race conditions using a shared lock instantiated as RLock for read operations. This occurs when messages come in from SQS and need to be dispatched to the appropriate handler. Simultaneously, during the creation of new subscriptions within the component, the shared lock is used as a Lock by writers to prevent stale reads.

However, this approach has an unintended consequence: the subscription consumption logic, which relies on the RLock, effectively blocks the logic responsible for registering new subscriptions to new handlers. This issue becomes particularly noticeable during system boot-up, especially when the SQS queue is filled with messages from various, yet-to-be-mapped topics. You can visualize this race condition issue by checking out this link.

To address this, we propose using a thread-safe, concurrent Go map. This would prevent write operations from being blocked by ongoing subscription consumption, thereby reducing the time required between system boot-up and accurate message handling. You can visualize the suggested solution by checking out this link.

@yaron2
Copy link
Member

yaron2 commented Sep 27, 2023

A PR for this change would be greatly appreciated

@amimimor
Copy link
Contributor

amimimor commented Oct 8, 2023

/assign @amimimor

@amimimor
Copy link
Contributor

amimimor commented Oct 8, 2023

Before submitting the PR for this issue, I'd like to highlight a few key points:

  • The issue mentioned by @tomatbay is partly due to the sns-sqs topology in AWS exhibiting a fan-in pattern.
  • This implies that multiple publisher topics could be mapped to a single sqs queue.
  • Consequently, this can lead to scenarios where, regardless of the disableEntityManagement setting, Subscribe calls to the component might result in receiving messages from topics that haven't been "registered" (i.e., component subscription vs. AWS infrastructure subscription) by the control plane in the component. This may cause the SQS consumers (i.e., subscribers) to process messages with a topic that the component isn't prepared to handle (a race condition where the infrastructure prevails over the component).
  • Similarly, when unsubscribing from a topic (as is done in the component's shutdown procedure), the infrastructure subscription removal might occur after the component has already removed the associated topic handler (a race condition where the component prevails over the infrastructure).

Given the intrinsic nature of the sns-sqs topology, I suggest the following steps:

  • Detail these scenarios in the documentation.
  • Permit, particularly during the component's initialization, an option (active only when disableEntityManagement is set to true) that ensures SQS consumption commences solely when the number of topics aligned with handlers matches.
  • This can be achieved by letting the component tally the infrastructure subscriptions linked to its designated queue, then pausing until the control plane initiates the Subscribe call for all of those identified subscriptions.
  • Clearly, this calls for further details that would be indeed given in the next issue we'd open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
4 participants