Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calico-kube-controllers fix concurrent map writes issue #8706

Draft
wants to merge 4 commits into
base: release-v3.26
Choose a base branch
from

Conversation

zamog
Copy link

@zamog zamog commented Apr 9, 2024

Description

Add mutex lock to the Set function to prevent a race condition that causes the process to panic with fatal error: concurrent map read and map write

Related issues/PRs

fixes 8705

Todos

  • Tests
  • Documentation
  • Release note

Release Note

TBD

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

  • docs-pr-required: This change requires a change to the documentation that has not been completed yet.
  • docs-completed: This change has all necessary documentation completed.
  • docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

  • release-note-required: This PR has user-facing changes. Most PRs should have this label.
  • release-note-not-required: This PR has no user-facing changes.

Other optional labels:

  • cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
  • needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

@zamog zamog requested a review from a team as a code owner April 9, 2024 18:53
@marvin-tigera marvin-tigera added this to the Calico v3.25.3 milestone Apr 9, 2024
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Apr 9, 2024
@CLAassistant
Copy link

CLAassistant commented Apr 9, 2024

CLA assistant check
All committers have signed the CLA.

Copy link

netlify bot commented Apr 9, 2024

Deploy Preview for calico-v3-25 canceled.

Name Link
🔨 Latest commit a726a60
🔍 Latest deploy log https://app.netlify.com/sites/calico-v3-25/deploys/66158e9e3486240008f42c37

@zamog zamog changed the base branch from release-v3.25 to release-v3.26 April 9, 2024 18:57
@zamog zamog marked this pull request as draft April 9, 2024 19:13
@zamog zamog marked this pull request as ready for review April 9, 2024 19:18
@lwr20
Copy link
Member

lwr20 commented Apr 10, 2024

/sem-approve

@aaaaaaaalex
Copy link
Contributor

aaaaaaaalex commented Apr 10, 2024

Thanks for the PR.

Could you show where the conflicting Write is happening on the resource?

If I understand the issue & panic correctly, the panic is generated by reflect pkg performing a read, during a concurrent write, on the WorkloadEndpointData item (particularly the struct's labels map I think?).

If the concurrent write is also occurring in the same Set method, your lock should work.

Do we know what component is writing to that labels map while reflect is reading it? And can we confidently say that synchronising calls of the Set method removes that concurrency?

On a side-note, can you submit a PR of the fix to master branch instead, rather than directly to the release branch, to ensure we don't regress in future releases. We can then backport the master patch to 3.26.

@zamog zamog marked this pull request as draft April 10, 2024 17:17
@@ -126,7 +126,9 @@ func (c *calicoCache) Set(key string, newObj interface{}) {
if reflect.TypeOf(newObj) != c.ObjectType {
c.log.Fatalf("Wrong object type received to store in cache. Expected: %s, Found: %s", c.ObjectType, reflect.TypeOf(newObj))
}

// lock the cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something isn't quite adding up to me here - the underlying map is within the threadSafeCache instance, which IIUC should already handle this.

Do you know why we need this second mutex?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-pr-required Change is not yet documented release-note-required Change has user-facing impact (no matter how small)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants