Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decrease lock contention in SingleThreadQueueExtent #2594

Merged
merged 4 commits into from Mar 4, 2024

Conversation

SirYwell
Copy link
Member

Overview

Fixes #2590

Description

This change addresses multiple smaller parts that sum up into a performance problem and race conditions:

  1. When e.g. having a //gmask set, the MaskingExtent will access blocks.
  2. Before, all those accesses went through one SingleThreadQueueExtent instance created here:
    super(handler.getQueue(world, new BatchProcessorHolder(), new BatchProcessorHolder()));
  3. This results in many threads competing for the lock here:

By introducing a ThreadLocalPassthroughExtent, we can provide the threads of the ParallelQueueExtent with enough context to use their own SingleThreadQueueExtent. Additionally, we have the Blocking Executor thread pool that also wants to access chunk data. As tasks running on those threads are called from ChunkHolder, we can statically provide the context for the running task. This is a bit dirty, but I think it is worth it!

As a side effect, these mentioned changes caused #2590 to happen regularly. As a fix, we no longer pool ChunkHolder objects. According to my measurements, only ~1/4 of the ChunkHolder objects were actually reused anyway (and the objects are really small, 56 bytes on typical JVM configurations). By not reusing them, we help avoiding from them ending up in an old generation. This might also have positive effects on GC overhead, but that part is basically impossible to measure.
By not recycling these objects, we don't run into the race condition anymore.

I also noticed that the mask in LocalSession might keep objects of the most recent EditSession in memory, so when remembering a change, we now also clean potential references to that edit session. This further helps making ChunkHolder short-lived objects.

Performance

The previously high lock contention resulted in situations, where most of the threads in the blocking executor are busy waiting. This cascaded into the Fork Join Pool Primary doing the work of the blocking executor because the tasks are rejected. As this then causes these threads to compete for the lock too, we ended up waiting most of the time:
FJPP
BE

With this change, we can properly use the CPU:
FJPP
BE

In my experiments, this brought a 100% speed improvement for medium and large edits.

Submitter Checklist

Edit tasklist title
Beta Give feedback Tasklist Submitter Checklist, more options

Delete tasklist

Delete tasklist block?
Are you sure? All relationships in this tasklist will be removed.
  1. Make sure you are opening from a topic branch (/feature/fix/docs/ branch (right side)) and not your main branch.
    Options
  2. Ensure that the pull request title represents the desired changelog entry.
    Options
  3. New public fields and methods are annotated with @since TODO.
    Options
  4. I read and followed the contribution guidelines.
    Options

@SirYwell SirYwell requested a review from a team as a code owner February 29, 2024 15:24
@IronApollo
Copy link
Member

Great work.

Any impact to small edits -- negative or positive?

@SirYwell
Copy link
Member Author

Great work.

Any impact to small edits -- negative or positive?

That's reallydifficult to measure, but from a quick test it seems to perform mostly the same.

As a note, we currently still generate 2 ChunkHolder objects per chunk. This comes from the fact that when submitting a chunk, we remove it from the STQE before

getChunkLock.lock();
chunks.remove(index, chunk);
getChunkLock.unlock();
V future = submitUnchecked(chunk);

so when we try to access data from that chunk directly afterwards, we have to load it again. This was the case before too basically (although less deterministic), just that these chunks were loaded through a different (the one specific I mentioned in the PR description) STQE instance. We can probably get rid of that somehow, but I didn't come up with a simple solution for that.

@NotMyFault NotMyFault requested a review from a team March 2, 2024 11:02
@NotMyFault NotMyFault added the Enhancement New feature or request label Mar 2, 2024
@dordsor21
Copy link
Member

dordsor21 commented Mar 2, 2024

fwiw, chunks may not be reused often when it's just one person editing, but on a larger server with up to tens of people editing simultaneously I wonder how often pooled chunks are used, and in that case, if there could be a negative impact on GC/performance due to more ChunkHolders now being generated and requiring GC. I would also assume that the cause of not using pooled chunks more frequently is, in actuality, caused by the issue with large amounts of lock contention, as chunks are submitted but take some time to be processed, meaning the ChunkHolder is not released to the pool again until much of the edit is completed, due to the priority we give to the threads completing the main edit work. I wonder what it would look like if correctly exposing the correct STQE to each thread were to be implemented in isolation.

One comment; why do we want a separate extent to handle the Thread-specific-STQE? Given it is intrinsic to correct use of the parallel extent we can simply add the behaviour there? Also, is there a reason for not using a ThreadLocal to store the STQEs?

@SirYwell
Copy link
Member Author

SirYwell commented Mar 2, 2024

fwiw, chunks may not be reused often when it's just one person editing, but on a larger server with up to tens of people editing simultaneously I wonder how often pooled chunks are used, and in that case, if there could be a negative impact on GC/performance due to more ChunkHolders now being generated and requiring GC.

I thought about that, but in my measurements, FAWE generated >6 GB of garbage during my edit, while ChunkHolder instances only accounted for 1 MB of that. There is even more garbage generated indirectly due to chunk loading, chunk saving etc. Therefore, I don't think it is worth to pool ChunkHolder objects.

I would also assume that the cause of not using pooled chunks more frequently is, in actuality, caused by the issue with large amounts of lock contention, as chunks are submitted but take some time to be processed, meaning the ChunkHolder is not released to the pool again until much of the edit is completed, due to the priority we give to the threads completing the main edit work.

That might be true, although from what it looks like the objects were recycled too soon rather than too late.

One comment; why do we want a separate extent to handle the Thread-specific-STQE? Given it is intrinsic to correct use of the parallel extent we can simply add the behaviour there?

Good idea.

Also, is there a reason for not using a ThreadLocal to store the STQEs?

I experimented with something where this was necessary, but I can roll it back I guess.

@dordsor21
Copy link
Member

I thought about that, but in my measurements, FAWE generated >6 GB of garbage during my edit, while ChunkHolder instances only accounted for 1 MB of that. There is even more garbage generated indirectly due to chunk loading, chunk saving etc. Therefore, I don't think it is worth to pool ChunkHolder objects.

I suppose for the small benefit vs the added complexity it's probably easiest not to pool then yeah. Minecraft chunks are so overloaded with objects (esp. during generation) already.

That might be true, although from what it looks like the objects were recycled too soon rather than too late.

Doesn't really matter anymore if we're removing pooling though anyways

I experimented with something where this was necessary, but I can roll it back I guess.

ThreadLocal should be more efficient as there's no synchronisation/locking involved and it's not like it removes and "security" for object retention/bleed as we should always still be uncaching from either implementation

@dordsor21 dordsor21 merged commit 1642713 into main Mar 4, 2024
11 checks passed
@dordsor21 dordsor21 deleted the perf/avoid-chunk-locking branch March 4, 2024 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement New feature or request
Projects
None yet
4 participants