Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for async out-of-band callback support #157

Open
mgravell opened this issue Mar 27, 2024 · 5 comments · May be fixed by #387
Open

Proposal for async out-of-band callback support #157

mgravell opened this issue Mar 27, 2024 · 5 comments · May be fixed by #387
Assignees
Labels
enhancement New feature or request

Comments

@mgravell
Copy link
Contributor

mgravell commented Mar 27, 2024

This is beyond features that Redis offers; I have discussed similar with Redis, but it hasn't gone anywhere; maybe Garnet can trailblaze here?

The FIFO nature of Redis request/response means that two scenarios cause severe head-of-line blocking problems:

  • explicitly blocking operations (BRPOP etc)
  • anything that stalls the data - checkpoints, slow disk read, etc

Wouldn't it be nice if async completion was possible, WITHIN the existing RESP protocol?

Proposal:

Add a feature that is consistent with the pattern used by the client side caching Redis feature, composed of four parts:

  1. ASYNC {ON|OFF} [REDIRECT conn] [OPTIN] - enables the async feature on a connection, optionally specifying a redirect pub/sub connection (required on RESP2, disallowed on RESP3), and the optional OPTIN that indicates whether this is on/off by default on a command-by-command basis; response: +OK
  2. ASYNC {YES|NO} - (requires ASYNC ON first) optional; explicitly enable/disable async for the next operation (typically used as ASYNC YES to turn async on for an individual call, when OPTIN is enabled) - response: +OK
  3. -ASYNC token new error response from the server when a specific operation has gone out-of-band, where token is the correlation identifier to match this result when available (only used when ASYNC ON is in operation, and the command is electing to use async)
  4. "push" message (on the redirect connection for RESP2, on the current connection for RESP3) that is an array with 3 parts: the category identifier "async", the correlation identifier, and the RESP fragment that is the response for that message; this includes all possible RESP responses including errors, etc

When is a command eligible for async?

  • ASYNC ON must have been issued, and no subsequent ASYNC OFF
  • if OPTIN was specified, there must be an ASYNC YES immediately† before the message
  • there must not be an ASYNC NO immediately† before the message

†=allowing for stacking with other "applies to the next command" commands, for example CLIENT CACHING

Notes:

  • async responses are optional; if the response is available synchronously, the result is returned instead of -ASYNC token
  • the correlation id is issued server-side and opaque; there is no planned API that allows a client to do anything with it
  • the server must guarantee uniqueness of the correlation id in the scope of a single connection (conflicting identifiers on unrelated connections is explicitly permitted)
  • any clients not prepared for async see this as a redis error that is clear - it won't work, but it also won't smash clients to pieces; this is directly comparable to -MOVED for cluster
  • clients that are async aware can silently ignore -ERR from ASYNC commands on servers that do not implement it; the next command will simply return synchronously which is exactly what we expect - no change to sync path
  • tokens do not need to be issued or stored/matched unless we go async: zero impact to sync path
  • tokens must be compatible with "simple error" strings; opaque integer token makes a lot of sense (concretely: "simple error" and "simple string" values must use printable ASCII excluding \r and \n)
  • choosing to use ASYNC is inherently also saying "I'm ok with this going out of FIFO"
  • async does not operate for commands inside MULTI (responses will -QUEUED, including for ASYNC messages) or for WATCH / DISCARD (response will always be synchronous); the server may (but is not required to) allow async completion for EXEC
  • async does not operate within Lua (responses will be synchronous)
  • async does not add timeout behaviour; commands that do already have a timeout (brpop etc) would be expected to work as such, though
  • the server is permitted to apply quotas or other reasons to not use an async response; for example, the server may wish to allow only N outstanding async commands per connection; it could choose to switch to sync only at that point, or could issue -ERR too many async operations
  • outstanding async operations are considered "doomed" if a connection ends for any reason, and should be discarded by client and server (obviously doing any necessary book-keeping/cleanup)
  • the ASYNC command will never itself give a -ASYNC response
  • connection state modifier commands like ASKING, READONLY, READWRITE, SELECT are always synchronous
  • there is no "order" in async; the ordered responses -ASYNC 123 and -ASYNC 456 may receive async responses in the order 456 then 123

Typical usage:

C: ASYNC ON REDIRECT 12344
S: +OK
C: GET foo
S: "adskjhsdfjkh"
C: GET bar
S: -ASYNC 3434534
C: GET blap
S: "asldkfslkfdgh"

.... (at some point, on subscription connection 12344)
S: ["async" "3434534" "askdjhasdkjsah"]

On RESP3 the REDIRECT 12344 would be omitted, and the server would respond (at some point) via "push" on the same connection.

Alternatively:

C: ASYNC ON REDIRECT 12344
S: +OK
C: ASYNC YES  # next command is allowed (not required) to give async response
S: +OK
C: GET foo
S: "adskjhsdfjkh"
C: ASYNC YES    # next command is allowed (not required) to give async response
S: +OK
C: GET bar
S: -ASYNC 3434534
C: GET blap     # this will never go ASYNC, because no ASYNC YES
S: "asldkfslkfdgh"
C: ASYNC NO
S: +OK
C: GET blap     # this will never go ASYNC, because ASYNC NO
S: "asldkfslkfdgh"

.... (at some point, on subscription connection 12344)
S: ["async" "3434534" "askdjhasdkjsah"]

I am happy to offer the time to get such an idea working with SE.Redis

@badrishc
Copy link
Contributor

Looks like a good idea, particularly for GET from slow disks or cloud storage backends. It can avoid the FIFO head of line blocking. Checkpoints do not block operations in Garnet so that's not a reason for us. We don't yet have BRPOP, but it will help when we do.

I prefer the RESP3 version. Making a second connection per connection is prohibitively inefficient for most scenarios I've seen that might use this. Is there no way to reuse the same connection in RESP2? If not then Garnet could offer the RESP3 semantics just for this.

@kevin-montrose
Copy link

Piping up to say, this would solve some issues my team is having as well.

When embedding Garnet it is desirable to conditionally go async, rather than always either blocking or forcing operations onto the threadpool. Today we just block, since most operations complete synchronously and the extra trip through the threadpool is murder on performance.

I prefer the RESP3 version. Making a second connection per connection is prohibitively inefficient for most scenarios I've seen that might use this. Is there no way to reuse the same connection in RESP2?

RESP3 introduces the Push type, the equivalent under RESP2 is to use a separate connection. This inefficiency in RESP2 is why the Push type was added 😊.

@mgravell
Copy link
Contributor Author

mgravell commented Mar 27, 2024

Technically (and I'll explain why I don't advocate this) we could break specification and issue a RESP3 push response out-of-band on the single connection in RESP2; a suitably educated client would know to expect and handle >3\r\n$5\r\nasync\r\n:12345\r\n{response}. But: my concern isn't about suitably educated clients; it is about "what would most clients expect?". And RESP2 clients will not expect such a message; they only expect out-of-band data on pub/sub connections, and will likely get desynchronized if you did this. RESP2+RESP3 clients might also be a little confused to see "push" data if we haven't completed HELLO 3

However: here's the thing - implementing RESP3 doesn't need to be hard! You do not necessarily need to change every available response to issue RESP3 counterparts; RESP3-aware clients need to handle both RESP2 and RESP3 (because they still want to talk to RESP2 servers), so: even if the only thing you did was implement the HELLO 3 handshake and allow "push" on that connection: that might be enough. Once the connection is RESP3, the server is free to issue "push" messages. You could simply not implement the REDIRECT feature, in that case.

Note that if SE.Redis is in RESP3 mode, it will want to use regular pub/sub on the RESP3 connection, so: any other pub/sub support would need to work on RESP3.

@darrenge darrenge added the enhancement New feature or request label Mar 27, 2024
@badrishc
Copy link
Contributor

I have one suggestion for the async API proposed above. It should have the notion of a barrier, that will complete (or await) all the earlier-issued asynchronous operations. This is useful for clients to maintain session consistency. For example, they may do a series of reads followed by a write, but might want the pending reads to complete first before issuing the write on that session. This is already available, called CompletePending(...) in FASTER and Tsavorite (Garnet's fork of FASTER).

Example API:

C: GET foo1
S: -ASYNC 3434534
C: GET foo2
S: "asldkfslkfdgh"
C: GET foo3
S: -ASYNC 3434535
C: BARRIER
S: ["async" "3434535" "askdjhasdkjsah_foo3"]
S: ["async" "3434534" "askdjhasdkjsah_foo1"]
S: OK # when we get this OK response to BARRIER, we know all async ops are over

@mgravell
Copy link
Contributor Author

mgravell commented Mar 29, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants