Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial fetch timeout (with working tests) #4193

Closed

Conversation

chrisbobbe
Copy link
Contributor

@chrisbobbe chrisbobbe commented Jul 16, 2020

This is the "minimal version" of a fix Greg describes in the issue; as he mentions there, it would be good to make follow-up improvements.

Fixes: #4165

EDIT: The below hack is no longer necessary, and this branch doesn't use it (Greg and I worked it out on the phone).

EDIT, again: Deleted the description of that hack in case it was deterring a review. 🙂

@chrisbobbe chrisbobbe marked this pull request as draft July 16, 2020 23:16
@chrisbobbe chrisbobbe force-pushed the pr-initial-fetch-timeout-with-tests branch 2 times, most recently from 61e2579 to c83268b Compare August 25, 2020 22:22
@chrisbobbe
Copy link
Contributor Author

chrisbobbe commented Sep 23, 2020

(Deleted a long comment about being stalled on a RN thing, which we've moved past; see below.)

@chrisbobbe
Copy link
Contributor Author

The current status of this is that I'm not sure I want to ram through a change in React Native (in our fork) that I know breaks React Native's tests, even if the code seems to work for us:

And, as I just noted at zulip/react-native#5 (comment), we've found an easier, less invasive way! I'll push a new revision with that soon and finally un-mark this as a draft.

@chrisbobbe chrisbobbe force-pushed the pr-initial-fetch-timeout-with-tests branch from 48b3a71 to c4137b8 Compare October 27, 2020 01:53
@chrisbobbe chrisbobbe marked this pull request as ready for review October 27, 2020 01:53
@chrisbobbe
Copy link
Contributor Author

And, as I just noted at zulip/react-native#5 (comment), we've found an easier, less invasive way! I'll push a new revision with that soon and finally un-mark this as a draft.

OK!

@chrisbobbe
Copy link
Contributor Author

I just fixed some conflicts and made a few commit-message tweaks. 🙂

@chrisbobbe
Copy link
Contributor Author

I just fixed some conflicts. 🙂

@chrisbobbe chrisbobbe force-pushed the pr-initial-fetch-timeout-with-tests branch 2 times, most recently from 1fe3ed6 to f6a2685 Compare April 13, 2021 01:05
@chrisbobbe
Copy link
Contributor Author

Just fixed a small conflict; small bump for a review (if still a P1 issue, at least). 🙂

Copy link
Contributor

@WesleyAC WesleyAC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed up until the commits removing Lolex.

Thanks for working on this!

* Notify Redux that we've given up on the initial fetch.
*
* Either because our timeout implementation says we've tried for long
* enough, or because the server has responded with a 5xx error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit talks about 5xx errors, the previous commit about 4xx errors. Is that correct? Does this action get sent if there's a 4xx error, or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit talks about 5xx errors, the previous commit about 4xx errors. Is that correct?

I believe so, although I think I could be clearer. 🙂 Specifically, the role I have in mind for INITIAL_FETCH_ABORT is not for cases where breaking from the retry loop is the only rational choice—rather, it's for when we just have a pretty good idea that retrying has stopped being useful, even if that isn't proven. Possibly renaming INITIAL_FETCH_ABORT could help here; at least, I should expand the jsdoc.

On a 4xx error, breaking from the retry loop is the only rational choice, and I don't have this action getting dispatched on those. It's a client error; what we've been sending the server isn't any good, and we should bail on this retry loop. We should send a Sentry error report, if the server tells us we've given it garbage; or log out, if the auth is invalid; and so on. I suppose, if we were nimble enough as a client, there might be cases where we have some other reasonable input to try giving the server as a fallback—but I think even in those cases it's cleanest to throw out the current retry loop and start a new one, with fresh backoff state.

If we've just waited long enough (that we don't think we'll get an answer soon, or that we think the user wants to stop waiting, etc.), that's a suggestion that we should break out of the retry loop, but it's not a proof that waiting longer would be useless. Same with 5xx errors, I think—the server hasn't told us we've done anything wrong, and the issue might clear up in a second or two. But, realistically, what are the chances that it'll be cleared up soon anyway...might as well give up, even though we might have gotten lucky if we kept trying. 🤷

(I do see that I haven't actually started dispatching INITIAL_FETCH_ABORT on 5xx errors in this revision, which is quite old…huh, perhaps I should do that! 🤔)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though, seeing the ratio of the amount I've said just now, above, to what actually made it into comments/jsdoc, I think it could be helpful to have my assumptions examined, possibly on CZO; what do you think? 🙂

Comment on lines +181 to +187
NavigationService.dispatch(resetToAccountPicker());
dispatch(initialFetchAbortPlain());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a common pattern? It seems strange to me, I'd expect the navigation to be driven by the receiver of the action, not the sender. But I don't know much about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect the navigation to be driven by the receiver of the action, not the sender.

Yeah; we used to have something like that.

Before #3804 was resolved, we would store React Navigation's state in our Redux store, and we had a navReducer that we could make recognize an action like initialFetchAbortPlain() and update the nav state accordingly.

(We actually mixed that approach with dispatching actions from the action creators in navActions, with the same dispatch we use for all the other actions. NavigationService is a stepping stone toward using React Navigation's navigation prop/object.)

With React Navigation's state not in our Redux store anymore, that old approach in navReducer isn't available.

We also can't add a go-navigate-somewhere side effect to our existing reducers because reducers explicitly aren't the place for side effects.

The choice to put NavigationService.dispatch(resetToAccountPicker()); (or I guess, one day, navigation.reset({ index: 0, routes: [{ name: 'account-pick' }] }); or similar) in the thunk action creator initialFetchAbort is a bit subtle: it follows our strategy to eliminate a class of bugs where we try to do stuff before the store has rehydrated; Greg has a good explanation of that here.

I've also thought it might be nice to not count on remembering to call go-navigate-somewhere code alongside dispatching things like logoutPlain(). The fact that we do is a reminder that dispatching logoutPlain() doesn't…finish the job of logging out: navigating away from the logged-in screens is an important part of that. 🙂

store.subscribe() comes to mind, but that's a low-level API and it doesn't tell a passed handler about actions that have been dispatched; only the fact that an action has been dispatched.

I think that leaves Redux middleware—I could imagine writing some middleware that does things like see an INITIAL_FETCH_ABORT-type action go by, and navigate based on that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also thought it might be nice to not count on remembering to call go-navigate-somewhere code alongside dispatching things like logoutPlain(). The fact that we do is a reminder that dispatching logoutPlain() doesn't…finish the job of logging out: navigating away from the logged-in screens is an important part of that. 🙂

Ah, I just edited logout() here to logoutPlain(), after I saw that that's what I meant. I think the situation there isn't quite as bad as I'd thought; logout() (the thunk action) does do the essential job of navigating, which makes it a bit more understandable that logoutPlain() does not. 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, in this PR, the thunk action initialFetchAbort() does do the essential job of navigating, while the plain action initialFetchAbortPlain() doesn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if I can unpack any of that, maybe on CZO. 🙂

/**
* Time-out a Promise after `timeLimitMs` has passed.
*
* Returns a new Promise with the same outcome as `promise`, if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "outcome" here (and in the ¶ below) is a bit confusing to me — is this common Promise terminology? I'd find something like:

Returns a new Promise that resolves to whatever `promise` resolved to, if
`promise` completes in time.

If `promise` does not complete before `timeLimitMs` has passed, this function
calls `onTimeout` and returns a promise that resolves to its return value.

clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah; I looked around for a more common term to use; the MDN doc on Promise just uses "outcome" once, and I liked it when it did. Hmm.

I meant to write the interface without assuming promise will resolve; instead, it might reject. See Greg's comment at #4166 (comment).

Maybe something like this:

 * Returns a new Promise with the same outcome (resolved/rejected) as
 * `promise`, if `promise` completes in time.
 *
 * If `promise` does not complete before `timeLimitMs` has passed,
 * `onTimeout` is called, and its outcome is used as the outcome of
 * the returned Promise.

@@ -271,26 +270,46 @@ const fetchPrivateMessages = () => async (dispatch: Dispatch, getState: GetState
* If the function is an API call and the response has HTTP status code 4xx
* the error is considered unrecoverable and the exception is rethrown, to be
* handled further up in the call stack.
*
* After a certain duration, times out with a TimeoutError.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/a certain duration/MAX_TIME_MS/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(EDIT: my point below is moot in the current revision)


I think the following point will be moot if we move MAX_TIME_MS (with a rename) somewhere else, like in config.js. 🙂 But it's a chance to bring up one of our habits so far.

In the code you're looking at, I intentionally left MAX_TIME_MS out of the jsdoc. Our pattern with functions has been to use jsdoc for interface, and //-style comments for implementation. While I haven't always gotten this right, I think I might have in this case (though with the big asterisk in my previous paragraph 😉): someone reading the jsdoc can't resolve MAX_TIME_MS to a meaningful value without peeking under the hood and looking at its definition in the function's implementation. For that reason, I figured it didn't belong in the jsdoc.

I realized one thing quite a while after learning about that pattern: VSCode, at least with our config, parses a function's jsdoc and shows it to would-be callers on a hover interaction, etc. (I imagine other editors can be made to do that too.) When I remember this, it can help me decide what's interface vs. implementation: a good jsdoc will help callers do their business without even having to open the file that the function is defined in.

I think the right way forward, here, might be to define the constant in config.js, in which case we can freely refer to it in the jsdoc because config.js is a scope that callers will naturally be aware of. Does that sound right? And @gnprice, how's my interpretation of the jsdoc / // pattern here; is it too rigid? 🙂

await backoffMachine.wait();
}
}
// Without this, Flow 0.92.1 does not know this code is unreachable,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we using Flow 0.128.0 now? Does that have this same bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looks like it; I'll update the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Er, and this has disappeared with the change to promiseTimeout's interface described at #4193 (comment).

🎉

*/
export async function tryFetch<T>(func: () => Promise<T>): Promise<T> {
const MAX_TIME_MS: number = 60000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minute seems quite long for this timeout. How'd you decide on that?

(It could be reasonable, certainly it's quite bad if we get into a failure loop because we picked too optimistic a value, I'm just curious how you arrived here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That came from the description of #4165; it's not super precise:

As a minimal version, if the initial fetch takes a completely unreasonable amount of time (maybe one minute), we should give up and take you to the account-picker screen so you can try a different account and server.

Still, good idea to say something in a code comment—possibly this actually belongs in config.js?

@chrisbobbe
Copy link
Contributor Author

chrisbobbe commented Apr 21, 2021

Thanks for the review, @WesleyAC! I've left some responses on each of your comments.

Looking again, I wonder if we really want to model promiseTimeout so closely on that Dart API. I don't really know anything about Dart, but it seems like we could reasonably simplify our promiseTimeout by omitting onTimeout and just having the returned Promise reject with a TimeoutError if the time is up and promise hasn't resolved/rejected yet.

@chrisbobbe
Copy link
Contributor Author

chrisbobbe commented May 12, 2021

(Awaiting discussion here before continuing with my revision.)

I've got an answer there and am ready to continue working on this. 🙂

So it's easier to compare the inputs we're testing. 10ms is a
realistic amount of time for an API request to take. It's also less
than 100ms, which had been used for one of these. So, our tests will
run a bit faster (and they'll run a lot faster when we switch over
to fake timers soon).
And make the tests more rigorous while we're at it.

When we add a timeout to `tryFetch`, we'll want to use fake timers
so that the tests don't try to do inconvenient things like waiting a
whole minute for something to happen.
The fact that this test passes isn't good -- it means a basically
arbitrary, unexpected kind of error will let the retry loop continue
without propagating to `tryFetch`'s caller. We'll fix that logic
soon, and add a test case with an error like that. But for now, test
that we get the right behavior with representative inputs.
We'll use this in `tryFetch`, in an upcoming commit, so that we can
interrupt in-progress attempts to contact a server when they take
too long. See discussion [1].

[1]: https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/Stuck.20on.20loading.20screen/near/907693
To be dispatched when it doesn't seem like we'll get a response from
the server, or when the server responds with a 5xx error.

It navigates to the 'accounts' screen so a user can try a different
account and server. Logging out wouldn't be good; the credentials
may be perfectly fine, and we'd like to keep them around to try
again later.

It sets `needsInitialFetch` to `false` [1], just like
`INITIAL_FETCH_COMPLETE`, while retaining a different meaning than
that action (i.e., that the fetch was aborted instead of completed).

Setting `needsInitialFetch` to false is necessary to ensure that a
subsequent initial fetch can be triggered when we want it to be. As
also noted in 7caa4d0, `needsInitialFetch` is "edge-triggered".
(That edge-triggering logic seems complex and fragile, and it would
be nice to fix that.)

See also discussion [1].

[1]: https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/Stuck.20on.20loading.20screen/near/907591
Change the condition for exiting the retry loop from
`isClientError(e)` to `!isServerError(e)`.
So far, `tryFetch`'s only callers are in the initial fetch; so, add
handling for the `TimeoutError` there.

The choice of value for `requestLongTimeoutMs` comes from a
suggestion in zulip#4165's description:

> As a minimal version, if the initial fetch takes a completely
> unreasonable amount of time (maybe one minute), we should give up
> and take you to the account-picker screen so you can try a
> different account and server.

Fixes: zulip#4165
As Greg points out [1], this makes the most sense conceptually; it's
happening at the bottom of the loop, just before a new iteration
starts. The `return` in the `try` block is enough to ensure that
this wait won't interfere with a successful fetch.

[1]: zulip#4166 (comment)
Greg points out that the initial fetch isn't actually a place where
we want to retry on 5xx errors [1]:

> Ah, I think in #M4165 the point is that if the server isn't
> responding, we want to give you the option to go choose some other
> account. The context there is that we're in the initial fetch, so
> showing the loading screen, and as long as we're doing that
> there's no other UI.

> So yeah, I think basically we don't want to do any retrying here.
> Instead we can kick you to the account-picker screen, with a toast
> or something to indicate an error, and then you might manually
> retry a time or two or you might bail and switch to some other
> account.

> And in particular if you didn't even want to be using that account
> anymore -- maybe you even know that it's a server which is
> permanently shut down, but it just happened to be the last one
> you'd been using in the app and so it's the one we tried loading
> data from on startup -- then you can go use whatever other account
> you were actually opening the app to use.

This does mean that `tryFetch` now has no callsites. It's a useful
function, so we'll add a useful callsite soon so the function still
gets exercised and maintained as appropriate.

[1] https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/Stuck.20on.20loading.20screen/near/1178689
A `TimeoutError` will be handled the same way other errors in
`fetchMessages` are handled; if it's a timeout in the fetch
`ChatScreen` does on mount, `ChatScreen` will show the `FetchError`
component we set up in zulip#4205.

There's also been a passing mention on CZO of doing a timeout like
this [1]:

> After a long time, probably like a minute, we'll want that [...]
> fetch to time out and fail in any case.

[1] https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/.23M4156.20Message.20List.20placeholders/near/950853
And add that message and the existing message to messages_en.json;
looks like we forgot to add the existing one.
Use Jest's "modern" fake timers instead of our Lolex wrapper.

Also, remove one `describe` block for tests that examine an
edge-case safety feature that we built into our Lolex wrapper, but
that doesn't seem to exist in Jest. Ah, well.
Use Jest's "modern" fake timers instead of our Lolex wrapper.
Use Jest's "modern" fake timers instead of our Lolex wrapper.
Use Jest's "modern" fake timers instead of our Lolex wrapper.
Use Jest's "modern" fake timers instead of our Lolex wrapper.
We've entirely switched over to Jest's "modern" fake timers, which
landed in jestjs/jest#7776.
Also, remove several now-unnecessary calls of
`jest.useFakeTimers('modern')`, but keep a few assertions that the
"modern" timers are actually being used.

In particular, our `jestSetup` is a central place where we make the
assertion. Not only is it good to check that we still intentionally
set the "modern" implementation, but we want to make sure that the
setting is correctly applied. See the note in fb23341 about it
being silently not applied until we added @jest/source-map as a
direct dependency.

We have an ESLint rule, from 2faad06, preventing imports from
'**/__tests__/**'; the rule is active in all files not matching that
same pattern. Add an additional override so that we can make the
"modern"-timers assertion from within `jest/jestSetup.js`.
Follow and delete a code comment at the top of
`backoffMachine-test`, suggesting that we move these tests.
@chrisbobbe chrisbobbe force-pushed the pr-initial-fetch-timeout-with-tests branch from f6a2685 to 8c20c8b Compare May 19, 2021 19:59
@chrisbobbe
Copy link
Contributor Author

chrisbobbe commented May 19, 2021

Revision pushed!

This one keeps having tryFetch retry on 5xx errors, but then the initial fetch stops using tryFetch and just uses promiseTimeout directly instead, since it's clear that we don't want to retry the initial fetch on 5xx errors (which is to say, we don't want to retry the initial fetch at all).

And this revision starts using tryFetch with GET /messages; we've discussed having that time out after a minute, as mentioned in that commit.

There are a lot of commits in this revision, so if you spot an opportunity to merge the first n commits, or something, feel free to do that, to help focus the review. I'm also happy to split this into multiple PRs. 🙂

@chrisbobbe
Copy link
Contributor Author

chrisbobbe commented May 20, 2021

Closing as superseded by #4753 and #4754; we've decided it's best to split this into two PRs.

@chrisbobbe chrisbobbe closed this May 20, 2021
@chrisbobbe chrisbobbe deleted the pr-initial-fetch-timeout-with-tests branch November 5, 2021 00:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Time out initial fetch, and go to account-picker screen
2 participants