Time out initial fetch and go to account-picker screen. #4754

chrisbobbe · 2021-05-20T21:40:34Z

~~This is meant to follow #4753~~ Done! I've split #4193 into two PRs, that one and this one, to make it easier to review.

This is the "minimal version" of a fix Greg describes in the issue; as he mentions there, it would be good to make follow-up improvements soon after merging this, or open a new issue for those.

Fixes: #4165

chrisbobbe · 2021-05-20T21:43:35Z

(edit: this is done)

Also, note to self: I should make it so we do retry the initial fetch on 5xx's if we have stale server data and the UI is visible; otherwise, there's no point in retrying, if we're looking at the full-screen loading indicator where the user can't do anything useful.

chrisbobbe · 2021-05-21T16:56:57Z

Also, note to self: I should make it so we do retry the initial fetch on 5xx's if we have stale server data and the UI is visible; otherwise, there's no point in retrying, if we're looking at the full-screen loading indicator where the user can't do anything useful.

Done; revision pushed!

chrisbobbe · 2021-05-22T00:58:24Z

Note to self: Probably a good idea to alert the user (and tell Sentry) which bad outcome occurred: a 5xx error or a timeout.

gnprice

Thanks @chrisbobbe ! Generally this looks good -- small comments below, and one more-significant question about the set of errors we're handling.

As of the end of the branch I believe we're handling 5xx, 4xx, and timeouts appropriately. But there could be other ways a network failure shows up: if the device is unable to make a TCP connection (perhaps because it's in airplane mode), or a TLS connection atop that (perhaps because there's something wrong operationally with the server), then we'll treat that as an unexpected error that represents a bug. Can we work out what those would look like here, and handle them appropriately so that there's no operational issue that could cause an uncaught exception?

Separately: there's a string of 8 commits at the end that convert over from Lolex to Jest's fake timers. Do those need to be at the end, or can they come earlier -- perhaps even as a separate PR?

gnprice · 2021-05-21T21:57:39Z

src/message/fetchActions.js

-      // TODO: This should be narrowed to `!isServerError(e)`; we
-      // should fail early if we encounter unrecognized / unexpected
-      // errors.
-      if (isClientError(e)) {
+      if (!isServerError(e)) {


Hmm. I know this is what the TODO comment said, but it doesn't feel quite right to me.

The issue is: what happens if there's a failure in the network connection? I think we basically want to treat those exactly the same as a 5xx error -- either one can be caused by operational issues, is usually transient, but may not be transient if the server is permanently offline.

I like this reasoning:

If we get some strange, totally unexpected error caused by a bug somewhere, we shouldn't continue the retry loop. Not only would that waste the user's time, but it means the error basically gets swallowed, and it makes it much harder for us to detect it and debug effectively. So, fail early on exceptions like that.

So I guess the point is that there's another category of specific, expected errors here, which is a network failure. I'm not sure exactly what those look like at this point in the code.

gnprice · 2021-05-21T22:18:23Z

src/message/fetchActions.js

+ * Makes a request that retries until success or a non-5xx error;
+ *   times out after `config.requestLongTimeoutMs`.


Can shorten the last clause to ", with timeout", to help keep this summary line concise. The detail of what the timeout's value is is good information to put in the prose below.

gnprice · 2021-05-22T05:00:25Z

jest/jestSetup.js

+/**
+ * Default should be set with `timers: 'modern'` in Jest config.
+ */
+assertUsingModernFakeTimers();


nit: this isn't a declaration of anything, so there's nothing the comment is describing the interface of -- so it should be a plain non-jsdoc implementation-comment

gnprice · 2021-05-22T05:04:10Z

src/chat/FetchError.js

+        {(() => {
+          if (this.props.error instanceof TimeoutError) {
+            return <Label style={styles.text} text="Request timed out." />;
+          } else {
+            return <Label style={styles.text} text="Oops! Something went wrong." />;
+          }
+        })()}


Probably simpler as a … ? … : … expression, no?

Probably simpler as a … ? … : … expression, no?

Sure. I used if / else because I thought there might be other cases to consider in the future, and this form would make it easier to expand on.

I see. Yeah, that seems like a fine reason.

chrisbobbe · 2021-06-02T20:48:36Z

Thanks for the review! Revision pushed.

Two (possibly related?) things have occurred to me:

Is INITIAL_FETCH_ABORT quite the right name, or should it be more like INITIAL_FETCH_ERROR? (Compare MESSAGE_FETCH_ERROR.)
One place in api.registerForEvents where things could conceivably fail is in the step where we transform the data we receive from the server (with a function named transform). Do we need to change our error handling for that at all?

chrisbobbe · 2021-06-25T00:16:41Z

(Just rebased.)

gnprice

Thanks for the revision! Sorry I've let this sit for a bit.

The app-level (i.e., outside src/api/) code looks good, with one comment below.

I think the apiErrors code doesn't quite work together properly with the existing code in that file. That existing code should change, I think, to handle this properly. I'll sketch out a version of how that could look, and either push some draft commits to this branch or send a PR.

gnprice · 2021-07-13T21:56:10Z

src/api/apiErrors.js

+ *   https://fetch.spec.whatwg.org/#ref-for-concept-network-error%E2%91%A5%E2%93%AA
+ */
+export const isNetworkRequestFailedError = (e: Error): boolean =>
+  e instanceof TypeError && e.message === 'Network request failed';


Huh, wacky that this is a TypeError.

I worry a bit about conditioning on the message. I don't see that text mentioned in the spec, so I wonder if it might vary between implementations -- or if it might sometimes be translated into the user's chosen language on the device.

I think that probably means:

Here inside src/api/, scoped tightly around the fetch, we should catch this error (as just instanceof TypeError, odd as that is), and throw a more specific error we make up.

Then app code can look for that.

gnprice · 2021-07-13T22:01:18Z

src/api/apiErrors.js

+export const isServerError = (e: Error): boolean =>
+  e instanceof ApiError && e.httpStatus >= 500 && e.httpStatus <= 599;


Hmm. Does this actually happen in practice?

Looking at makeErrorFromApi above, which is where we construct ApiError objects, it looks like on a typical 5xx response we'll currently throw just an Error object, with some text in the message. (We'd throw an ApiError on a 5xx response if it comes with a body that's valid JSON and decodes to a nice Zulip API error object { result: 'error', msg: string, code?: string | void, ... }. But a 5xx is a server failure, and may not carry nice well-formed JSON like that.)

So I think we want the code in src/api/ to be throwing a more distinctive error in that case, and then to have isServerError detect that.

gnprice · 2021-07-13T22:15:54Z

src/message/__tests__/fetchActions-test.js

+    );
+
+    describe('if asked to retry', () => {
+      test('retries a call if there is a recoverable error', async () => {


fetchActions: Fail early on all non–known-retryable errors in `tryFetch`.

I appreciate your careful use of both an en-dash and a hyphen. Wouldn't want a reader to think we're talking about errors that are retryable in a non-known fashion :-P

I wish I still had access to the Chicago Manual of Style; it looks like this would be covered in Chapter 6: https://www.chicagomanualofstyle.org/16/ch06/ch06_toc.html 😛 (but I no longer have my college-sponsored subscription, alas).

gnprice · 2021-07-13T22:31:58Z

src/message/fetchActions.js

+    } else {
+      logging.warn(e, {
+        message: 'Unexpected error during initial fetch and serverSettings fetch.',
+      });


We probably want to be aborting in this case too, right?

Basically it's an internal error in the app. But the user is probably better off getting kicked back to the account picker than having the loading screen continue indefinitely -- it means at least they can try again and see if the issue recurs, or try another account and see if they can at least use that.

gnprice · 2021-07-14T05:33:37Z

I think the apiErrors code doesn't quite work together properly with the existing code in that file. That existing code should change, I think, to handle this properly. I'll sketch out a version of how that could look, and either push some draft commits to this branch or send a PR.

OK, and sent #4896 for that.

As Greg points out [1]: > Basically it's an internal error in the app. But the user is > probably better off getting kicked back to the account picker than > having the loading screen continue indefinitely -- it means at > least they can try again and see if the issue recurs, or try > another account and see if they can at least use that. [1] zulip#4754 (comment)

chrisbobbe · 2021-07-15T19:29:49Z

Thanks for the review, and for #4896! Revision pushed.

We'll use this in `tryFetch`, in an upcoming commit, so that we can interrupt in-progress attempts to contact a server when they take too long. See discussion [1]. [1]: https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/Stuck.20on.20loading.20screen/near/907693

To be dispatched when we want to give up on the initial fetch. It navigates to the 'accounts' screen so a user can try a different account and server. Logging out wouldn't be good; the credentials may be perfectly fine, and we'd like to keep them around to try again later. It sets `needsInitialFetch` to `false` [1], just like `INITIAL_FETCH_COMPLETE`, while retaining a different meaning than that action (i.e., that the fetch was aborted instead of completed). Setting `needsInitialFetch` to false is necessary to ensure that a subsequent initial fetch can be triggered when we want it to be. As also noted in 7caa4d0, `needsInitialFetch` is "edge-triggered". (That edge-triggering logic seems complex and fragile, and it would be nice to fix that.) See also discussion at https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/Stuck.20on.20loading.20screen/near/907591.

…ch`. The TODO doesn't specifically mention that we want to retry on network-failure errors, but we do, so add that too. If we get some strange, totally unexpected error caused by a bug somewhere, we shouldn't continue the retry loop. Not only would that waste the user's time, but it means the error basically gets swallowed, and it makes it much harder for us to detect it and debug effectively. So, fail early on exceptions like that. See our doc on exception handling: https://github.com/zulip/zulip-mobile/blob/master/docs/style.md#catch-at-ui.

So far, `tryFetch`'s only callers are in the initial fetch; so, add handling for the `TimeoutError` there. The choice of value for `requestLongTimeoutMs` comes from a suggestion in zulip#4165's description: > As a minimal version, if the initial fetch takes a completely > unreasonable amount of time (maybe one minute), we should give up > and take you to the account-picker screen so you can try a > different account and server. Fixes: zulip#4165

As Greg points out [1]: > Basically it's an internal error in the app. But the user is > probably better off getting kicked back to the account picker than > having the loading screen continue indefinitely -- it means at > least they can try again and see if the issue recurs, or try > another account and see if they can at least use that. [1] zulip#4754 (comment)

As Greg points out [1], this makes the most sense conceptually; it's happening at the bottom of the loop, just before a new iteration starts. The `return` in the `try` block is enough to ensure that this wait won't interfere with a successful fetch. [1]: zulip#4166 (comment)

Greg points out that we don't always want to retry the initial fetch on server errors [1]: > Ah, I think in #M4165 the point is that if the server isn't > responding, we want to give you the option to go choose some other > account. The context there is that we're in the initial fetch, so > showing the loading screen, and as long as we're doing that > there's no other UI. > So yeah, I think basically we don't want to do any retrying here. > Instead we can kick you to the account-picker screen, with a toast > or something to indicate an error, and then you might manually > retry a time or two or you might bail and switch to some other > account. > And in particular if you didn't even want to be using that account > anymore -- maybe you even know that it's a server which is > permanently shut down, but it just happened to be the last one > you'd been using in the app and so it's the one we tried loading > data from on startup -- then you can go use whatever other account > you were actually opening the app to use. If we're in an initial fetch from an already active, logged-in account that we have cached data for, though, we might as well take the time to retry (still with the 60 second timeout) because the user will be seeing a useful, interactive UI. [1] https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/Stuck.20on.20loading.20screen/near/1178689

A `TimeoutError` will be handled the same way other errors in `fetchMessages` are handled; if it's a timeout in the fetch `ChatScreen` does on mount, `ChatScreen` will show the `FetchError` component we set up in zulip#4205. There's also been a passing mention on CZO of doing a timeout like this [1]: > After a long time, probably like a minute, we'll want that [...] > fetch to time out and fail in any case. [1] https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/.23M4156.20Message.20List.20placeholders/near/950853

And add that message and the existing message to messages_en.json; looks like we forgot to add the existing one.

gnprice · 2021-07-16T00:24:00Z

Looks good, thanks! Merging.

chrisbobbe · 2021-07-16T00:26:45Z

Thanks for the review! 🙂

I got the following error when I ran `tools/tx-sync`: ``` Step 3: Upload strings to translate... + tx --quiet push -s tx ERROR: Error received from server: Duplicate string key ('Oops! Something went wrong\.') in line 247 tx ERROR: Could not upload source file. You can use --skip to ignore this error and continue the execution. tx ERROR: Error received from server: Duplicate string key ('Oops! Something went wrong\.') in line 247 ``` Looks like maybe a rebase error when I was working on zulip#4754. The string was already added, in zulip#4829.

This is our first use of `react-test-renderer`. It piggy-backs on our incorporation of Jest's "modern" fake-timer implementation in PRs zulip#4754 and zulip#4931. That was handy! I haven't yet found any test cases that fail with our implementation. (And I'd been hoping to, to debug an unexpected error!) But I did try pasting in an earlier iteration of the hook's implementation, from zulip#4940, that Greg had found bugs in by reading the code. Many of these tests failed on that buggy implementation, which is a good sign. Might as well keep these new tests, then, if they're not an unreasonable maintenance burden.

chrisbobbe requested review from gnprice and WesleyAC May 20, 2021 21:40

chrisbobbe mentioned this pull request May 20, 2021

Initial fetch timeout (with working tests) #4193

Closed

chrisbobbe added the P1 high-priority label May 20, 2021

chrisbobbe mentioned this pull request May 20, 2021

Prep for timing out the initial fetch. #4753

Merged

chrisbobbe added the blocked on other work To come back to after another related PR, or some other task. label May 20, 2021

chrisbobbe force-pushed the pr-initial-fetch-timeout-again branch from 95a4fc6 to 6c8b787 Compare May 21, 2021 16:55

chrisbobbe removed the blocked on other work To come back to after another related PR, or some other task. label May 21, 2021

gnprice reviewed May 22, 2021

View reviewed changes

chrisbobbe force-pushed the pr-initial-fetch-timeout-again branch from 6c8b787 to 118782f Compare June 2, 2021 20:42

chrisbobbe force-pushed the pr-initial-fetch-timeout-again branch from 118782f to 1683743 Compare June 25, 2021 00:16

gnprice reviewed Jul 13, 2021

View reviewed changes

gnprice mentioned this pull request Jul 14, 2021

Clean up API error-distinguishing logic #4896

Merged

chrisbobbe force-pushed the pr-initial-fetch-timeout-again branch from 1683743 to b47f617 Compare July 15, 2021 19:29

chrisbobbe added 8 commits July 15, 2021 17:23

fetchActions: Improve jsdoc.

2436ac2

fetchActions: Solidify an assumption and add logging if it's violated.

bf28c95

chrisbobbe added 5 commits July 15, 2021 17:23

fetchActions [nfc]: Let tryFetch's callers opt out of retrying.

7664460

FetchError: Show a specific message if the error is a TimeoutError.

06f07c0

And add that message and the existing message to messages_en.json; looks like we forgot to add the existing one.

fetchActions: Add Sentry logging for message-fetch errors.

ab62f73

gnprice force-pushed the pr-initial-fetch-timeout-again branch from b47f617 to ab62f73 Compare July 16, 2021 00:24

gnprice merged commit ab62f73 into zulip:master Jul 16, 2021

chrisbobbe mentioned this pull request Jul 16, 2021

i18n: Remove a duplicate message key. #4902

Merged

chrisbobbe mentioned this pull request Jul 30, 2021

Stop using Lolex, since we're using Jest's "modern" fake timers. #4931

Merged

chrisbobbe mentioned this pull request Sep 9, 2021

reactUtils tests: Start testing useHasStayedTrueForMs. #4997

Merged

chrisbobbe deleted the pr-initial-fetch-timeout-again branch November 4, 2021 21:52

gnprice mentioned this pull request Mar 17, 2022

Sending outbox messages is fraught with issues #3881

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time out initial fetch and go to account-picker screen. #4754

Time out initial fetch and go to account-picker screen. #4754

chrisbobbe commented May 20, 2021 •

edited

chrisbobbe commented May 20, 2021 •

edited

chrisbobbe commented May 21, 2021 •

edited

chrisbobbe commented May 22, 2021 •

edited

gnprice left a comment

gnprice May 21, 2021

gnprice May 21, 2021

gnprice May 21, 2021

gnprice May 22, 2021

gnprice May 22, 2021

chrisbobbe May 24, 2021

gnprice May 25, 2021

chrisbobbe commented Jun 2, 2021

chrisbobbe commented Jun 25, 2021

gnprice left a comment

gnprice Jul 13, 2021

gnprice Jul 13, 2021

gnprice Jul 13, 2021

chrisbobbe Jul 14, 2021

gnprice Jul 13, 2021

gnprice commented Jul 14, 2021

chrisbobbe commented Jul 15, 2021

gnprice commented Jul 16, 2021

chrisbobbe commented Jul 16, 2021

		* Makes a request that retries until success or a non-5xx error;
		* times out after `config.requestLongTimeoutMs`.

		export const isServerError = (e: Error): boolean =>
		e instanceof ApiError && e.httpStatus >= 500 && e.httpStatus <= 599;

Time out initial fetch and go to account-picker screen. #4754

Time out initial fetch and go to account-picker screen. #4754

Conversation

chrisbobbe commented May 20, 2021 • edited

chrisbobbe commented May 20, 2021 • edited

chrisbobbe commented May 21, 2021 • edited

chrisbobbe commented May 22, 2021 • edited

gnprice left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrisbobbe commented Jun 2, 2021

chrisbobbe commented Jun 25, 2021

gnprice left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnprice commented Jul 14, 2021

chrisbobbe commented Jul 15, 2021

gnprice commented Jul 16, 2021

chrisbobbe commented Jul 16, 2021

chrisbobbe commented May 20, 2021 •

edited

chrisbobbe commented May 20, 2021 •

edited

chrisbobbe commented May 21, 2021 •

edited

chrisbobbe commented May 22, 2021 •

edited