Sending outbox messages is fraught with issues #3881

rk-for-zulip · 2020-02-07T00:44:34Z

There is a small constellation of intertwined problems with message-sending.

All outbox messages are sent immediately, with no throttling.
Outbox messages may be sent in any order, regardless of their input order.
Server-rejected messages are not marked as such and will be repeatedly re-sent, wasting bandwidth and battery.
Outbox messages may be successfully sent multiple times.
If the app is interrupted and killed while trying to send messages, it may become permanently stuck, and will never send messages again unless it is uninstalled or its data is wiped.
(... etc.?)

gnprice · 2020-02-07T06:29:04Z

If the app is interrupted and killed while trying to send messages, it may become permanently stuck, and will never send messages again unless it is uninstalled or its data is wiped.

Huh fascinating! How does this one happen?

Outbox messages may be successfully sent multiple times.

Definitely possible in principle: I think the scenario would be that a send request succeeds on the server, but the response doesn't make it to us; and the new-message event (which we make into EVENT_NEW_MESSAGE) doesn't reach us either before we retry.

Is that the kind of scenario you have in mind, or is there another path?

All of these would be good to address. I imagine a single solution may address several of them at once.

gnprice · 2020-02-07T06:34:02Z

See also #3829, #3731, #3584, #2374. And #3247 is an issue with a similar nature on a different codepath: double-sends of posting an emoji reaction to a message.

gnprice · 2020-02-07T06:39:15Z

OK, and in particular:

Outbox messages may be successfully sent multiple times.

Definitely possible in principle: I think the scenario would be that a send request succeeds on the server, but the response doesn't make it to us; and the new-message event (which we make into EVENT_NEW_MESSAGE) doesn't reach us either before we retry.

This one is #2374. There's some nice discussion on that thread, too.

Server-rejected messages are not marked as such and will be repeatedly re-sent, wasting bandwidth and battery.

I think this is the same issue as #3731, describing another aspect of the symptoms. (Which would be a nice point to add to that thread.)

rk-for-zulip · 2020-02-07T19:25:05Z

If the app is interrupted and killed while trying to send messages, it may become permanently stuck, and will never send messages again unless it is uninstalled or its data is wiped.

Huh fascinating! How does this one happen?

outboxSending is stored in Redux. If the app is dehydrated with that set to true, then killed, there is no mechanism that will clear it.

On reread, I think that's actually not possible at the moment... but only thanks to other bugs. tryUntilSuccessful doesn't await (it's not even async), and since it always returns true, sendOutbox will never await either. I don't think it's currently possible for the dehydration code to run while outboxSending is true.

At least, not unless there's an exception-generating bug in tryUntilSuccessful somewhere. That could do it.

Outbox messages may be successfully sent multiple times.

Definitely possible in principle: I think the scenario would be that a send request succeeds on the server, but the response doesn't make it to us; and the new-message event (which we make into EVENT_NEW_MESSAGE) doesn't reach us either before we retry.

Is that the kind of scenario you have in mind, or is there another path?

There is, sadly. The message-sending tasks are completely loose – their conclusion is unordered with the rest of the program, so there's no sequencing between them and outboxSending being unset. In particular, if we get two calls to sendOutbox in synchronous succession, the message-sending tasks will all be fired twice before any send-attempts are made.

This can all be fixed on our end, at least in theory. The scenario you described is another matter, and may require API changes to prevent. (Though EVENT_NEW_MESSAGE should provide some mitigation.)

All of these would be good to address. I imagine a single solution may address several of them at once.

I would be very wary of any solution that did not.

rk-for-zulip · 2020-02-07T19:56:38Z

I think this is the same issue as #3731, describing another aspect of the symptoms. (Which would be a nice point to add to that thread.)

They're certainly closely related, but only because they'll need essentially the same data to be stored in order to fix. (See below.) It's otherwise technically possible to fix either without the other – and in particular I wouldn't want the fix for the network behavior to be blocked on having an appropriate display form.

It's probably appropriate to copy my comment there over here, though:

From offline brainstorming: perhaps there should be an optional httpResponse field in Outbox. [...]

We could also give httpResponse a special non-numeric "too old to send" value [...] and retain messages that would have been dropped due to age [...].

gnprice · 2020-02-29T00:56:44Z

outboxSending is stored in Redux. If the app is dehydrated with that set to true, then killed, there is no mechanism that will clear it.

Aha. There's actually a good reason this can't happen. That flag specifically appears at state.session.outboxSending... and state.session is among the parts of our Redux state we specifically don't have redux-persist put into persistent storage.

Quoting from src/boot/store.js:

/**
 * Properties on the global store which we explicitly choose not to persist.
 *
 * All properties on the global store should appear either here or in the
 * lists of properties we do persist, below.
 */
// prettier-ignore
export const discardKeys: Array<$Keys<GlobalState>> = [
  'alertWords', 'caughtUp', 'fetching',
  'nav', 'presence', 'session', 'topics', 'typing', 'userStatus',
];

That makes state.session an appropriate spot for information, like this, that's about what the current live app process is actively doing.

gnprice · 2020-02-29T01:16:53Z

The message-sending tasks are completely loose – their conclusion is unordered with the rest of the program, so there's no sequencing between them and outboxSending being unset. In particular, if we get two calls to sendOutbox in synchronous succession, the message-sending tasks will all be fired twice before any send-attempts are made.

I see. I think you're referring to the fact that this await:

export const trySendMessages = (dispatch: Dispatch, getState: GetState): boolean => {
  // ...
  try {
    outboxToSend.forEach(async item => {
      // ...
      await api.sendMessage(auth, {

is in a forEach callback, and so nothing actually awaits the promise that that async function returns. And as a result here:

export const sendOutbox = () => async (dispatch: Dispatch, getState: GetState) => {
  // ...
  dispatch(toggleOutboxSending(true));
  while (!trySendMessages(dispatch, getState)) {
    await progressiveTimeout(); // eslint-disable-line no-await-in-loop
  }
  dispatch(toggleOutboxSending(false));
};

we set outboxSending back to false without actually waiting for any of those api.sendMessage promises to complete.

One nuance:

the message-sending tasks will all be fired twice before any send-attempts are made.

I don't think this is quite right. They'll all be fired twice before any responses to the send-attempts are received. But remember that await foo() is await (foo()), i.e. the foo() function call happens before any awaiting is needed... and that when a JS async function is called, the function body starts executing immediately, just like a non-async function, and yields only at await or return.

So it should be the case that when you call api.sendMessage(..), it synchronously gets all the way to a fetch call. (If not, that's probably a bug in our code, and kind of a surprising one.) And that, similarly, will fire off an actual network request -- then return a promise which will get resolved (or rejected) when the request concludes, and ultimately api.sendMessage(..) returns some promise chained off of that one.

Still, firing twice before seeing any responses is clearly not the desired behavior here either.

gnprice · 2020-02-29T01:51:07Z

Looking again at this code, I think the basic problem is that that outboxToSend.forEach loop is a forEach loop, and the code (as quoted in my previous comment) was clearly written with the expectation it would behave like a for loop.

In particular the boolean return from trySendMessages is almost completely useless given the way the code actually works -- the only way to make sense of the intention here is that it wires up a failed request in api.sendMessage to cause us to do an await progressiveTimeout(). An exception in the top half of that forEach callback, before reaching the await, (a) is a lot less likely, (b) would be a bug in our code, not an external condition like a bad network connection, (c) would therefore make no sense to retry and less sense to back off before retrying.

But also, if that loop is changed to be a real JS loop so that the awaits actually happen in sequence, then I believe that fixes three of the points mentioned in the OP:

All outbox messages are sent immediately, with no throttling.

Outbox messages may be sent in any order, regardless of their input order.

Outbox messages may be successfully sent multiple times.

(For the third, it fixes the cause of it you had in mind, though not the other cause I then mentioned above.)

So, seems like we should do that!

I also went and took a look at the history of this file, following my speculations above about what the authors were thinking. 😉 The logic became the way it is in #3272 -- which in particular was supposed to fix #3259, sending messages out of order. Before that, #1079 introduced the forEach(async ..) confusion.

rk-for-zulip · 2020-02-29T01:56:20Z

So, seems like we should do that!

Except that if we do – or at least, if we just do that – then that causes another bug to surface: the presence of an unsendable message in the Outbox will prevent any other messages from being sent for a week, after which all sendable messages (up to the next unsendable message, anyway) will suddenly be flushed to the server at once.

(This would have been worse before the recent 'discard unsendable messages after a week' change. Back then it would just have silently prevented any further messages from ever being sent at all.)

gnprice · 2020-02-29T02:21:51Z

Aha. There's actually a good reason this can't happen. [...] state.session is among the parts of our Redux state we specifically don't have redux-persist put into persistent storage.

This is a pretty subtle aspect of how the app works. I've just pushed 664ee09, which adds a bit more discussion of this in a place which hopefully helps make it somewhat easier to discover.

gnprice · 2020-02-29T02:28:05Z

the presence of an unsendable message in the Outbox will prevent any other messages from being sent for a week

Yeah, we should behave differently on

a 4xx status, indicating there's a problem with the specific request; and
a 5xx status or lack of response, either way indicating there's a problem with the server and/or network.

In the latter case, we should retry with backoff, and keep other outbox messages in the queue behind this one, along the lines of what this code was intended to do.

In the former, we should treat the message as permanently unsendable. One more-immediate thing we could do, to unblock fixing the rest of the logic without regressing anything, would be: on a 4xx status, move on to the next outbox message in the queue, but leave this one there, basically like we (accidentally) do in the current code.

Write a complete replacement for `trySendMessages`. This is not yet hooked in. Will fix: zulip#3881, (others...)

This loop doesn't do what it looks like it does. It looks like a loop that tries to send one message, awaits that, then tries the next, and so on. In fact, because Array#forEach does nothing with the return values of its callback -- in particular it doesn't notice if they're promises or anything else, and doesn't await them -- the actual behavior is to fire off a bunch of requests in parallel, wait for nothing, and ignore errors. That behavior isn't good, and we should fix it; that's zulip#3881. For a start, just make the code more explicit about what it actually does: desugar the async/await keywords to the equivalent use of Promise#then, and add a comment for good measure. This also smooths the way to rewriting Array#forEach here as part of an automated sweep across our code.

It'll be easier to write a test for it, when we eventually attempt that, probably pending resolution of zulip#3881. In particular, without this change, Flow would complain if we tried to pass `store.dispatch` -- where `store` is mocked using redux-mock-store -- as the first argument, for which our custom `Dispatch` type is expected. I suspect this could be fixed by tightening up the redux-mock-store libdef's `Dispatch` type. But, for consistency's sake, `trySendMessages` should probably be a thunk action creator in any case. In e5268bb (and confirmed again with logging just now), we observed that the return value of our function (in this case a boolean) will be passed through and returned by the `dispatch` call.

chrisbobbe · 2021-01-11T18:50:09Z

There's been quite a bit of productive discussion around here. The main piece of prep work we'd like to get done for our solution is #4193, which introduces a handy promiseTimeout function. We can use that function to time out a send-message request.

Marking as blocked by #4193, then, as noted in chat.

gnprice · 2022-03-17T23:12:18Z

Marking as blocked by #4193, then

That PR was merged (after being split in two: #4753, #4754), so unblocking.

gnprice · 2022-03-17T23:23:09Z

I think the next step, when we pick this thread of work up again, will be to take that chat thread from 2020-12 and make a fresh draft of the proposed design. (@chrisbobbe sent a design up at the top of the thread, and then there was further discussion but nobody has compiled a revised draft all in one place.)

That will include a state diagram, akin to this one, and a type definition, like this one crossed with the full draft from the start of the thread.

chrisbobbe · 2022-07-13T19:42:10Z

I think the next step, when we pick this thread of work up again, will be to take that chat thread from 2020-12 and make a fresh draft of the proposed design.

I've picked this back up again, here. I've written a new state diagram but I'd like to go through some concerns with it before doing the other parts.

chrisbobbe · 2022-07-21T18:07:57Z

Here's our latest state diagram, from discussion:

            User cancels the scheduled send.
           ┌────────────────────────────────────────────────────────────────────────┐
           │                                                                        │
           │                                                                        │
           │                                   User cancels during send (#4170).    │
           │                                  ┌───────────────────────────────────┐ │
           │                                  │                                   │ │
           │                                  │                                   │ │
           │                                  │                Event received,    │ │
(create)   │      Time for the scheduled      │                or we abandoned    │ │
  │        │      send (a try/retry).         │    200.        the queue.         ▼ ▼
  └► should-send ───────────────────────► sending ─────► sent ────────────────► (delete)
       │ ▲ ▲                                │ │                                     ▲
       │ │ │                                │ │                                     │
       │ │ │ App quit: schedule auto-retry. │ │                                     │
       │ │ │                                │ │                                     │
       │ │ │ 5xx, network error, or (with   │ │                                     │
       │ │ │ #4170) 60s network timeout:    │ │                                     │
       │ │ │ schedule auto-retry with       │ │                                     │
       │ │ │ backoff.                       │ │ 4xx.                                │
       │ │ └────────────────────────────────┘ └───────────────────────┐             │
       │ │                                                            │             │
       │ │                                                            │    User     │
       │ │ User requested a retry; schedule to run immediately.       ▼    cancels. │
       │ └──────────────────────────────────────────────────────── failed ──────────┘
       │                                                              ▲
       │                                                              │
       │ Too old: message is "better never than late". Too much time  │
       │ from creation or last user-requested retry.                  │
       └──────────────────────────────────────────────────────────────┘

sevmonster · 2023-06-06T14:42:23Z

I just had a situation where a user sending 4 messages turned into 20 due to a connectivity issue with the server... I assume that #5525 is also related to this.

rk-for-zulip mentioned this issue Feb 7, 2020

Permit use of commas in stream names #3734

Merged

gnprice added the a-compose/send Compose box, autocomplete, camera/upload, outbox, sending label Feb 7, 2020

rk-for-zulip mentioned this issue Mar 12, 2020

Block imports of files via a whitelist #3970

Closed

chrisbobbe mentioned this issue Apr 15, 2020

Long, sometimes endless loading in message list #4033

Open

rk-for-zulip mentioned this issue Apr 16, 2020

Time out sending a message, and indicate failure #3829

Open

gnprice assigned rk-for-zulip May 1, 2020

rk-for-zulip mentioned this issue May 6, 2020

Outbox documentation #4088

Draft

rk-for-zulip added a commit to rk-for-zulip/zulip-mobile that referenced this issue May 20, 2020

XXX outbox: write replacement for trySendMessages

d0ac976

Write a complete replacement for `trySendMessages`. This is not yet hooked in. Will fix: zulip#3881, (others...)

chrisbobbe mentioned this issue Dec 7, 2020

uploads: Show file upload progress. #4129

Closed

2 tasks

chrisbobbe added the blocked on other work To come back to after another related PR, or some other task. label Jan 11, 2021

gnprice mentioned this issue Mar 15, 2021

Bulk-mark-as-read button is often unresponsive #4531

Open

gnprice mentioned this issue Apr 9, 2021

Messages disappear on send #3117

Closed

chrisbobbe mentioned this issue Apr 13, 2021

Use AbortController to abort fetches and more. #4170

Open

gnprice added the P1 high-priority label Apr 15, 2021

gnprice removed the blocked on other work To come back to after another related PR, or some other task. label Mar 17, 2022

gnprice mentioned this issue Mar 17, 2022

events: Remove brittle needsInitialFetch system #5300

Merged

This was referenced Mar 5, 2024

Messaging group including deactivated user fails silently #5674

Open

Messages sent once get received multiple times. #5525

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sending outbox messages is fraught with issues #3881

Sending outbox messages is fraught with issues #3881

rk-for-zulip commented Feb 7, 2020 •

edited

gnprice commented Feb 7, 2020

gnprice commented Feb 7, 2020

gnprice commented Feb 7, 2020

rk-for-zulip commented Feb 7, 2020 •

edited

rk-for-zulip commented Feb 7, 2020 •

edited

gnprice commented Feb 29, 2020

gnprice commented Feb 29, 2020

gnprice commented Feb 29, 2020

rk-for-zulip commented Feb 29, 2020

gnprice commented Feb 29, 2020

gnprice commented Feb 29, 2020

chrisbobbe commented Jan 11, 2021

gnprice commented Mar 17, 2022

gnprice commented Mar 17, 2022

chrisbobbe commented Jul 13, 2022 •

edited

chrisbobbe commented Jul 21, 2022 •

edited

sevmonster commented Jun 6, 2023

Sending outbox messages is fraught with issues #3881

Sending outbox messages is fraught with issues #3881

Comments

rk-for-zulip commented Feb 7, 2020 • edited

gnprice commented Feb 7, 2020

gnprice commented Feb 7, 2020

gnprice commented Feb 7, 2020

rk-for-zulip commented Feb 7, 2020 • edited

rk-for-zulip commented Feb 7, 2020 • edited

gnprice commented Feb 29, 2020

gnprice commented Feb 29, 2020

gnprice commented Feb 29, 2020

rk-for-zulip commented Feb 29, 2020

gnprice commented Feb 29, 2020

gnprice commented Feb 29, 2020

chrisbobbe commented Jan 11, 2021

gnprice commented Mar 17, 2022

gnprice commented Mar 17, 2022

chrisbobbe commented Jul 13, 2022 • edited

chrisbobbe commented Jul 21, 2022 • edited

sevmonster commented Jun 6, 2023

rk-for-zulip commented Feb 7, 2020 •

edited

rk-for-zulip commented Feb 7, 2020 •

edited

rk-for-zulip commented Feb 7, 2020 •

edited

chrisbobbe commented Jul 13, 2022 •

edited

chrisbobbe commented Jul 21, 2022 •

edited