Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds: Fix WeakReference bug in SharedCallCounterMap #8466

Merged
merged 6 commits into from Sep 2, 2021

Conversation

dapengzhang0
Copy link
Member

@dapengzhang0 dapengzhang0 commented Sep 1, 2021

Fixes #8397.
#8397 is caused by mistakenly clearing up a map entry right after the entry is recreated after gc. Reproduced in regression test.

(SharedCallCounterMap is hard to read and can easily be a source of bugs. It's impossible to sufficiently test the class with unit test because GC can happen anytime concurrently with the method being tested. I'm not 100% confident about the correctness of the fix. If possible I would avoid using WeakReference in the first place.)

@@ -73,6 +73,9 @@ void cleanQueue() {
CounterReference ref;
while ((ref = (CounterReference) refQueue.poll()) != null) {
Map<String, CounterReference> clusterCounter = counters.get(ref.cluster);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing clusterCounter shouldn't be null, because refs should be enqueued in the same order as the order of underlying referents being nullified by garbage collector. But I did not see javadoc explicitly say that.

Is there any risk of NPE in extreme race case like the following?
ref1.referent nullified by gc => ref2 created and put in the counters map => ref2.referent nullified by gc => ref2 enqueued => ref1 enqueued.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With C1 it doesn't seem too far-fetched, especially if enqueuing is a separate stage of the process from clearing. I doubt it would actually happen, but it seems fair to consider.

A simple solution for that is to call ref.enqueue() if ref.get() == null, before replacing the reference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. TIL thanks.

public void gcAndRecreate() {
@SuppressWarnings("UnusedVariable") // assign to null for GC only
AtomicLong counter = map.getOrCreate(CLUSTER, EDS_SERVICE_NAME);
final CounterReference ref = counters.get(CLUSTER).get(EDS_SERVICE_NAME);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, you can call ref.clear() and ref.enqueue() manually instead of relying on GC.

@@ -73,6 +73,9 @@ void cleanQueue() {
CounterReference ref;
while ((ref = (CounterReference) refQueue.poll()) != null) {
Map<String, CounterReference> clusterCounter = counters.get(ref.cluster);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With C1 it doesn't seem too far-fetched, especially if enqueuing is a separate stage of the process from clearing. I doubt it would actually happen, but it seems fair to consider.

A simple solution for that is to call ref.enqueue() if ref.get() == null, before replacing the reference.

@dapengzhang0 dapengzhang0 merged commit 07747c5 into grpc:master Sep 2, 2021
@dapengzhang0 dapengzhang0 deleted the fix-weakref-bug branch September 2, 2021 17:25
dapengzhang0 added a commit to dapengzhang0/grpc-java that referenced this pull request Sep 2, 2021
Fixes grpc#8397.
grpc#8397 is caused by mistakenly clearing up a map entry right after the entry is recreated after gc. Reproduced in regression test.
dapengzhang0 added a commit that referenced this pull request Sep 2, 2021
Fixes #8397.
#8397 is caused by mistakenly clearing up a map entry right after the entry is recreated after gc. Reproduced in regression test.
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

io.grpc.xds.SharedCallCounterMap.cleanQueue() NullPointerException
2 participants