Decrease lock contention in SingleThreadQueueExtent #2594

SirYwell · 2024-02-29T15:24:14Z

Overview

Description

This change addresses multiple smaller parts that sum up into a performance problem and race conditions:

When e.g. having a //gmask set, the MaskingExtent will access blocks.
Before, all those accesses went through one SingleThreadQueueExtent instance created here:

FastAsyncWorldEdit/worldedit-core/src/main/java/com/fastasyncworldedit/core/queue/implementation/ParallelQueueExtent.java

Line 62 in f94b96d

super(handler.getQueue(world, new BatchProcessorHolder(), new BatchProcessorHolder()));
This results in many threads competing for the lock here:

FastAsyncWorldEdit/worldedit-core/src/main/java/com/fastasyncworldedit/core/queue/implementation/SingleThreadQueueExtent.java

Line 293 in f94b96d

getChunkLock.lock();

By introducing a ThreadLocalPassthroughExtent, we can provide the threads of the ParallelQueueExtent with enough context to use their own SingleThreadQueueExtent. Additionally, we have the Blocking Executor thread pool that also wants to access chunk data. As tasks running on those threads are called from ChunkHolder, we can statically provide the context for the running task. This is a bit dirty, but I think it is worth it!

As a side effect, these mentioned changes caused #2590 to happen regularly. As a fix, we no longer pool ChunkHolder objects. According to my measurements, only ~1/4 of the ChunkHolder objects were actually reused anyway (and the objects are really small, 56 bytes on typical JVM configurations). By not reusing them, we help avoiding from them ending up in an old generation. This might also have positive effects on GC overhead, but that part is basically impossible to measure.
By not recycling these objects, we don't run into the race condition anymore.

I also noticed that the mask in LocalSession might keep objects of the most recent EditSession in memory, so when remembering a change, we now also clean potential references to that edit session. This further helps making ChunkHolder short-lived objects.

Performance

The previously high lock contention resulted in situations, where most of the threads in the blocking executor are busy waiting. This cascaded into the Fork Join Pool Primary doing the work of the blocking executor because the tasks are rejected. As this then causes these threads to compete for the lock too, we ended up waiting most of the time:

With this change, we can properly use the CPU:

In my experiments, this brought a 100% speed improvement for medium and large edits.

Submitter Checklist

Give feedback

Make sure you are opening from a topic branch (**/feature/fix/docs/ branch** (right side)) and not your main branch.

Make sure you are opening from a topic branch (/feature/fix/docs/ branch (right side)) and not your main branch.
Options
Successfully updated the issue's project

There was an error updating the issue's project
Ensure that the pull request title represents the desired changelog entry.

Ensure that the pull request title represents the desired changelog entry.
Options
Successfully updated the issue's project

There was an error updating the issue's project
New public fields and methods are annotated with `@since TODO`.

New public fields and methods are annotated with @since TODO.
Options
Successfully updated the issue's project

There was an error updating the issue's project
I read and followed the [contribution guidelines](https://github.com/IntellectualSites/.github/blob/main/CONTRIBUTING.md).

I read and followed the contribution guidelines.
Options
Successfully updated the issue's project

There was an error updating the issue's project
Options

IronApollo · 2024-02-29T16:06:10Z

Great work.

Any impact to small edits -- negative or positive?

SirYwell · 2024-02-29T16:42:38Z

Great work.

Any impact to small edits -- negative or positive?

That's reallydifficult to measure, but from a quick test it seems to perform mostly the same.

As a note, we currently still generate 2 ChunkHolder objects per chunk. This comes from the fact that when submitting a chunk, we remove it from the STQE before

FastAsyncWorldEdit/worldedit-core/src/main/java/com/fastasyncworldedit/core/queue/implementation/SingleThreadQueueExtent.java

Lines 213 to 216 in f94b96d

    
           getChunkLock.lock(); 
        
           chunks.remove(index, chunk); 
        
           getChunkLock.unlock(); 
        
           V future = submitUnchecked(chunk);

so when we try to access data from that chunk directly afterwards, we have to load it again. This was the case before too basically (although less deterministic), just that these chunks were loaded through a different (the one specific I mentioned in the PR description) STQE instance. We can probably get rid of that somehow, but I didn't come up with a simple solution for that.

dordsor21 · 2024-03-02T12:44:31Z

fwiw, chunks may not be reused often when it's just one person editing, but on a larger server with up to tens of people editing simultaneously I wonder how often pooled chunks are used, and in that case, if there could be a negative impact on GC/performance due to more ChunkHolders now being generated and requiring GC. I would also assume that the cause of not using pooled chunks more frequently is, in actuality, caused by the issue with large amounts of lock contention, as chunks are submitted but take some time to be processed, meaning the ChunkHolder is not released to the pool again until much of the edit is completed, due to the priority we give to the threads completing the main edit work. I wonder what it would look like if correctly exposing the correct STQE to each thread were to be implemented in isolation.

One comment; why do we want a separate extent to handle the Thread-specific-STQE? Given it is intrinsic to correct use of the parallel extent we can simply add the behaviour there? Also, is there a reason for not using a ThreadLocal to store the STQEs?

SirYwell · 2024-03-02T13:12:07Z

fwiw, chunks may not be reused often when it's just one person editing, but on a larger server with up to tens of people editing simultaneously I wonder how often pooled chunks are used, and in that case, if there could be a negative impact on GC/performance due to more ChunkHolders now being generated and requiring GC.

I thought about that, but in my measurements, FAWE generated >6 GB of garbage during my edit, while ChunkHolder instances only accounted for 1 MB of that. There is even more garbage generated indirectly due to chunk loading, chunk saving etc. Therefore, I don't think it is worth to pool ChunkHolder objects.

I would also assume that the cause of not using pooled chunks more frequently is, in actuality, caused by the issue with large amounts of lock contention, as chunks are submitted but take some time to be processed, meaning the ChunkHolder is not released to the pool again until much of the edit is completed, due to the priority we give to the threads completing the main edit work.

That might be true, although from what it looks like the objects were recycled too soon rather than too late.

One comment; why do we want a separate extent to handle the Thread-specific-STQE? Given it is intrinsic to correct use of the parallel extent we can simply add the behaviour there?

Good idea.

Also, is there a reason for not using a ThreadLocal to store the STQEs?

I experimented with something where this was necessary, but I can roll it back I guess.

dordsor21 · 2024-03-02T13:47:37Z

I thought about that, but in my measurements, FAWE generated >6 GB of garbage during my edit, while ChunkHolder instances only accounted for 1 MB of that. There is even more garbage generated indirectly due to chunk loading, chunk saving etc. Therefore, I don't think it is worth to pool ChunkHolder objects.

I suppose for the small benefit vs the added complexity it's probably easiest not to pool then yeah. Minecraft chunks are so overloaded with objects (esp. during generation) already.

That might be true, although from what it looks like the objects were recycled too soon rather than too late.

Doesn't really matter anymore if we're removing pooling though anyways

I experimented with something where this was necessary, but I can roll it back I guess.

ThreadLocal should be more efficient as there's no synchronisation/locking involved and it's not like it removes and "security" for object retention/bleed as we should always still be uncaching from either implementation

SirYwell added 3 commits February 28, 2024 14:38

thread local extent

d1dc1a5

avoid race conditions due to ChunkHolder pooling

6ef46a0

clean up JFR events, javadoc

97dd7e6

SirYwell requested a review from a team as a code owner February 29, 2024 15:24

NotMyFault approved these changes Mar 2, 2024

View reviewed changes

NotMyFault requested a review from a team March 2, 2024 11:02

NotMyFault added the Enhancement New feature or request label Mar 2, 2024

remove ThreadLocalPassthroughExtent

e5f803b

dordsor21 approved these changes Mar 3, 2024

View reviewed changes

dordsor21 merged commit 1642713 into main Mar 4, 2024
11 checks passed

dordsor21 deleted the perf/avoid-chunk-locking branch March 4, 2024 06:31

dordsor21 mentioned this pull request Mar 19, 2024

Cannot invoke "com.fastasyncworldedit.core.queue.IQueueExtent.getCachedGet(int, int)" because "this.extent" is null #2553

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decrease lock contention in SingleThreadQueueExtent #2594

Decrease lock contention in SingleThreadQueueExtent #2594

SirYwell commented Feb 29, 2024

Submitter Checklist

IronApollo commented Feb 29, 2024

SirYwell commented Feb 29, 2024

dordsor21 commented Mar 2, 2024 •

edited

SirYwell commented Mar 2, 2024

dordsor21 commented Mar 2, 2024

Decrease lock contention in SingleThreadQueueExtent #2594

Decrease lock contention in SingleThreadQueueExtent #2594

Conversation

SirYwell commented Feb 29, 2024

Overview

Description

Performance

Submitter Checklist

IronApollo commented Feb 29, 2024

SirYwell commented Feb 29, 2024

dordsor21 commented Mar 2, 2024 • edited

SirYwell commented Mar 2, 2024

dordsor21 commented Mar 2, 2024

dordsor21 commented Mar 2, 2024 •

edited