Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coalesce reads in Actor Cache flush #1814

Closed
wants to merge 12 commits into from

Conversation

MellowYarker
Copy link
Contributor

No description provided.

@MellowYarker MellowYarker changed the title Coalesce reads in Actor Cache flush [WIP] Coalesce reads in Actor Cache flush Mar 12, 2024
@a-robinson
Copy link
Member

Exciting! Let me know if/when you'd like me to take a look at this.

@MellowYarker
Copy link
Contributor Author

MellowYarker commented Mar 12, 2024

Exciting! Let me know if/when you'd like me to take a look at this.

There's still 1 failing test (ActorCache list stream cancellation) but definitely feel free to read through @a-robinson. Admittedly my commits could probably just be fixups on Ben's but I felt it was better to append than overwrite since I wasn't familiar with the code.

I just tossed in one more commit to try and (at least partially) fix that failing test, but it doesn't seem to be doing much. Other than that, I think I may try to add some more comments and do a bit more refactoring.

@MellowYarker
Copy link
Contributor Author

https://github.com/cloudflare/workerd/compare/035450664ec894cc6d26540e096d7c4f612652fa..a529139f3b54dacedd1b6b77b0b3bebe2babc63d

^ tries to make it a bit easier to track when Entrys are added to the lists and when we're transitioning EntrySyncStatus states.

@MellowYarker MellowYarker changed the title [WIP] Coalesce reads in Actor Cache flush Coalesce reads in Actor Cache flush Mar 14, 2024
@MellowYarker MellowYarker force-pushed the milan/actor-cache-batch-reads branch 2 times, most recently from 4feb0f1 to c25eca9 Compare March 18, 2024 15:49
@MellowYarker
Copy link
Contributor Author

https://github.com/cloudflare/workerd/compare/a529139f3b54dacedd1b6b77b0b3bebe2babc63d..4feb0f11e12a52065c9f38936ff165d4af9c622e was a rebase to fix conflicts. I accidentally pulled 1 extra commit so the second force push just gets us back to this feature branch on top of main.

Copy link
Member

@a-robinson a-robinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a partial review of the first handful of commits for now, sorry. I'll continue reviewing the rest, although have a couple questions that could use a response as well.

src/workerd/io/actor-cache.c++ Outdated Show resolved Hide resolved
src/workerd/io/actor-cache.c++ Outdated Show resolved Hide resolved
src/workerd/io/actor-cache.c++ Outdated Show resolved Hide resolved
src/workerd/io/actor-cache.h Outdated Show resolved Hide resolved
src/workerd/io/actor-cache.h Show resolved Hide resolved
src/workerd/io/actor-cache.c++ Show resolved Hide resolved
src/workerd/io/actor-cache-test.c++ Show resolved Hide resolved
src/workerd/io/actor-cache-test.c++ Show resolved Hide resolved
src/workerd/io/actor-cache-test.c++ Show resolved Hide resolved
@MellowYarker
Copy link
Contributor Author

Just a partial review of the first handful of commits for now, sorry. I'll continue reviewing the rest, although have a couple questions that could use a response as well.

Thanks Alex, was hoping to get to some of the comments today but I noticed there was a new segfault. Thought something broke during the rebase, but it was just a bad commit. This took way longer to figure out than it should've 🙃

https://github.com/cloudflare/workerd/compare/d673ea074b7970362c48ea38034359757727c6c1..70a2d4637607cc2dc5028d723acd4dcaa975c20e

Copy link
Member

@a-robinson a-robinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it looks like you're aware of it, but there are a bunch of TODO(now) comments to work out :)

bcaimano and others added 12 commits March 26, 2024 11:25
This allows us to mutate from UNKNOWN to PRESENT or ABSENT but permit no other transition.
This commit does a few notable things:
- Counted deletes now use the lastFlush promise instead of their own fulfillers. This does mean that counted delete promises will resolve later in some rare cases like a counted delete and a delete all being requested within the same scheduled flush or a counted delete being issued as part of a transaction. The goal here is to get to a point where *all* operations just have to wait for the flush promise after they are scheduled.
- Counted deletes now maintain their own list of participating entries (somewhat like delete all). This means that we no longer have to "inherit" counted deletes when entries are overridden or do iterator tracking when building their flush batches.
This config doesn't need to chage, so let's convey that in the code.
This commit:
- Makes `syncStatus` private
- Makes removeEntry() the only way we remove from lists
- Introduces methods to add to the dirty or clean list, so that
  transitioning between `EntrySyncStatus` states is easier to follow.
- Adds/corrects some comments
The `dirtyList` is a kj::List with a thin wrapper around its methods.
When we add or remove from the list, we also update the size (total
bytes).

This was fine until we started modifying the size of the elements of the
list. In this case, we put an Entry in the `dirtyList` when we want to
read it from storage, and when the result from storage is ready we
overwrite whatever is in `Entry`. When we would later attempt to remove
the Entry from the `dirtyList`, it would decrement the size past 0 and
cause underflow, which would cause tests to hang.
It turns out we skip flushing in preview sessions entirely.

Rather than trying to find a way to only flush reads, we can just
do what we've always done prior to coalescing reads.
Promises were being dropped, but we mark them nodiscard.

We also explicit check for calls to mockStorage.
This commit attempts to provide a mechanism to cancel the promise
that waits for the read batches to complete in syncGets(). This is
needed for when the `ActorCache` is destroyed, since we need to destroy
all `Entry`s before we destroy the `SharedLru`.
Copy link
Member

@a-robinson a-robinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished reviewing all test changes, will look at the code changes now.

Sorry for some of the comments being more questions and/or things I could poke at myself once my devspace is usable again, feel free to breeze past any such comments that aren't useful.

@@ -1427,70 +1413,122 @@ KJ_TEST("ActorCache read retry on flush containing only puts") {
KJ_ASSERT(KJ_ASSERT_NONNULL(promise.wait(ws)) == "123");
}

KJ_TEST("ActorCache read hard fail") {
// KJ_TEST("ActorCache read hard fail") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this test commented out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we've moved reads to flush along with writes, if the read fails, it looks like we're actually breaking the output gate 😬.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😬

src/workerd/io/actor-cache-test.c++ Show resolved Hide resolved
src/workerd/io/actor-cache-test.c++ Show resolved Hide resolved
src/workerd/io/actor-cache-test.c++ Show resolved Hide resolved
.thenReturn(CAPNP(numDeleted = 2));
mockTxn->expectCall("delete", ws)
.withParams(CAPNP(keys = ["quux", "baz"]))
.thenReturn(CAPNP(numDeleted = 1234)); // count ignored
.withParams(CAPNP(keys = ["baz"]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't worry about it now, but I wonder if this could be merged with the delete of quux at this point, given baz is an uncounted delete and quux is a counted delete whose count has already been previously filled in.

Might be worth adding a TODO for, but there's clearly already enough happening in this PR, and it's possible the code complexity wouldn't be worth it anyway.

@@ -1744,6 +1768,9 @@ KJ_TEST("ActorCache list() with limit") {
// Cached if we try it again though.
KJ_ASSERT(expectCached(test.list("bar", "qux", 4)) ==
kvs({{"bar", "456"}, {"baz", "789"}, {"foo", "123"}, {"garply", "54321"}}));

// Return our uncached get from earlier.
mockStorage->expectCall("get", ws).withParams(CAPNP(key = "fooa")).thenReturn(CAPNP());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, so now we issue a get for this even though its promise was dropped? And somehow this comes after the list() that was issued later than it (line 1752 for the list vs line 1739 for the get)? I don't understand this :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this expect closer to the actual get() b1b503c

Huh, so now we issue a get for this even though its promise was dropped?
I don't understand this :/

That makes two of us 😅, it's not clear to me why our GetWaiter isn't dtor'd here when it is above.

src/workerd/io/actor-cache-test.c++ Show resolved Hide resolved
// Now that we've confirmed `qux`, `xxx`, and `yyy` are in cache, we can let `bar` and `baz` be
// inserted into cache. If we did individual `get()`s, then `bar` and `baz` would actually
// evict `qux` and `xxx` before we got to verify if they were in cache!
mockStorage->expectCall("getMultiple", ws)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it at all worrisome that we now send these gets to the backend despite neither of their callers actually waiting around on the results? It seems preferable to avoid the storage reads entirely if the caller abandoned the operations before the event loop was yielded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, especially because in some cases we drop it, ex. ActorCache read overwrite

  // Make some gets but overwrite them in the cache with puts.
  auto promise1 = expectUncached(test.get("foo"));
  auto promise2 = expectUncached(test.get("bar"));
  (void)expectUncached(test.get("baz"));

  test.put("foo", "456");
  test.put("bar", "789");
  test.put("baz", "123");

In this case, we don't do the get() for baz, but in other cases, ex. this comment we still do the get.

It's not yet clear to me what's causing this, but I think it's worth figuring out.

.withParams(CAPNP(keys = ["bar", "baz", "qux"]), "stream"_kj)
.useCallback("stream", [&](MockClient stream) {
// TODO(now): Using `kilobyte` for value results in the error "External constants not allowed."
// What can we do to fix that?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yeah, this is messy because of how CAPNP is a preprocessor macro that's just stringifying its input into the capnproto text format.

I can help play with this once I get my dev environment back if you want (e.g. perhaps a #define KILOBYTE xxx... would work here), but TBH I don't think it's really a problem if we have to leave these in so long as there's a comment explaining why.

src/workerd/io/actor-cache-test.c++ Show resolved Hide resolved
src/workerd/io/actor-cache.h Show resolved Hide resolved
src/workerd/io/actor-cache.c++ Show resolved Hide resolved
src/workerd/io/actor-cache.c++ Outdated Show resolved Hide resolved
@@ -291,8 +301,21 @@ void ActorCache::touchEntry(Lock& lock, Entry& entry) {
}

void ActorCache::removeEntry(Lock& lock, Entry& entry) {
KJ_IASSERT(!entry.isDirty());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I kind of appreciated that we had some protections against accidentally removing dirty entries (and ensuring their countedDelete is carried forward) rather than happily doing so silently here.

src/workerd/io/actor-cache.h Show resolved Hide resolved
src/workerd/io/actor-cache.h Show resolved Hide resolved
src/workerd/io/actor-cache.h Outdated Show resolved Hide resolved
src/workerd/io/actor-cache.h Outdated Show resolved Hide resolved
src/workerd/io/actor-cache.h Show resolved Hide resolved
src/workerd/io/actor-cache.h Show resolved Hide resolved
src/workerd/io/actor-cache.c++ Show resolved Hide resolved
src/workerd/io/actor-cache.c++ Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants