Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queries hang and are not issued on network requests #9690

Closed
Tracked by #9718
andreialecu opened this issue May 6, 2022 · 45 comments · Fixed by #9793
Closed
Tracked by #9718

Queries hang and are not issued on network requests #9690

andreialecu opened this issue May 6, 2022 · 45 comments · Fixed by #9793

Comments

@andreialecu
Copy link
Contributor

andreialecu commented May 6, 2022

Intended outcome:

Queries should be dispatched to the network.

Actual outcome:

Queries are not dispatched to the network, they just hang.

While using apollo-angular@3.0.1 with @apollo/client@3.6.x it appears that some queries start hanging after a while.

I'm also using graphql-code-generator to generate an apollo-angular data service, so usage looks like this:

this.someQuery
      .fetch({
        myVariable: 1
      })
      .pipe(map((res) => res.data.someQuery))
      .subscribe(foo => console.log(foo));

After a while (within 1-2 minutes), the subscribe block will not be called any more. This can also be used with rxjs's await firstValueFrom(...) and the promise will never resolve.

Downgrading to 3.6.2 3.5.10 seems to fix the problem.

How to reproduce the issue:

Unfortunately no repro, but hopefully it might be easy to pinpoint what would affect these by comparing 3.6.2 to 3.6.3. 3.6.0 and 3.5.10.

Versions

System:
    OS: macOS 12.3.1
  Binaries:
    Node: 16.14.2 - ~/.nvm/versions/node/v16.14.2/bin/node
    Yarn: 3.2.0-git.20220329.hash-0764215 - ~/.nvm/versions/node/v16.14.2/bin/yarn
  Browsers:
    Chrome: 101.0.4951.54
    Safari: 15.4
  npmPackages:
    @apollo/client: 3.6.3 => 3.6.3
    apollo-angular: ^3.0.1 => 3.0.1 
    apollo-link-scalars: ^4.0.1 => 4.0.1
@andreialecu
Copy link
Contributor Author

FWIW, we're also using HttpBatchLink from apollo-angular and the default options are:

  const defaultOptions = {
    watchQuery: {
      fetchPolicy: 'no-cache',
      errorPolicy: 'ignore',
    },
    query: {
      fetchPolicy: 'no-cache',
      errorPolicy: 'all',
    },
  };

no-cache queries are definitely affected, we only have a few cached ones in this project and I cannot confirm if they're also affected or not at this point.

@benjamn
Copy link
Member

benjamn commented May 6, 2022

@andreialecu Do you think you could put together a small runnable reproduction reproducing the regression? Or open a PR with a failing test (no need to fix it; we will investigate)?

@benjamn benjamn added this to the v3.6.x patch releases milestone May 6, 2022
@andreialecu
Copy link
Contributor Author

andreialecu commented May 6, 2022

@benjamn the only change I see in 3.6.3 that could impact apollo-angular in any way is the one from this commit: 77119b0

The other changes seem to be react only.

I'm not sure if I'll be able to reproduce it in isolation, since it's random when it starts happening. It could be a race condition of some sort with the batching link. I will try to investigate further.

@benjamn
Copy link
Member

benjamn commented May 6, 2022

You can test your theory about that commit by running these commands locally:

git clone git@github.com:apollographql/apollo-client.git
cd apollo-client
npm install
git revert 77119b06b165dbcb02d8ea7aeef17a12a08d059c
npm run build
cd dist
npm pack
cd ..

That will give you a .tgz bundle that you can install locally in your application:

cd ~/path/to/your/app
npm i ~/path/to/apollo-client/dist/apollo-client-3.6.3.tgz

Hope that helps with your investigation (or at least is easier than creating a reproduction)!

@benjamn benjamn self-assigned this May 6, 2022
@andreialecu
Copy link
Contributor Author

@benjamn somehow I'm not able to reproduce this early today, so I'm very confused.

Yesterday it was reproducible every single time in production, staging and development - but it was occurring during peak times.

We use graphql subscriptions and there's a lot of chatter over them, that's the only thing that could be relevant. One of the subscriptions then triggers a query on a certain condition.

I'll try to reproduce it during the next peak.

The problems cleared up for all of our customers after deploying a downgrade to 3.4.17 initially as per my comment in #9456 (comment)

I've then tried upgrading up until 3.6.2 and I couldn't reproduce it. On 3.6.3 it was reproducible again.

Peak times were fading by that point, so it might be possible the problem isn't exactly between 3.6.2 and 3.6.3 but could be earlier. I'll continue monitoring this over the next few days.

@andreialecu
Copy link
Contributor Author

I have an update.

It appears this also happens on 3.6.2 (deployed it in production earlier) so is not a recent regression in 3.6.3.

It seems to be related to peak times somehow (a lot of subscriptions chatter at least).

I have reverted to 3.4.17 for now where everything seems fine. I haven't yet checked 3.5.x.

Since this happens in apollo angular I assume it's an issue in the core and not the react part.

@andreialecu andreialecu changed the title Regression in 3.6.3: Queries hang and are not issued on network requests Queries hang and are not issued on network requests May 7, 2022
@andreialecu
Copy link
Contributor Author

@benjamn another update. I've been troubleshooting this today and setting breakpoints once it started happening.

Seems when it hangs it starts hitting this line of code:

Screenshot 2022-05-07 at 21 18 22

But the error is swallowed somewhere and is never thrown. I'm not sure if it's apollo-angular hiding it.

This is with 3.6.2

I found #7608 which seems related.

@andreialecu
Copy link
Contributor Author

The reason it's probably swallowed is that sub ends up being null here:

Screenshot 2022-05-07 at 21 32 30

@andreialecu
Copy link
Contributor Author

I tracked this down further and it seems to be caused by various rxjs observables interacting with each other and restarting some subscriptions as a side-effect. Seems to be similar to what has been reported in this comment: #7608 (comment)

Notice the screenshot for a stack trace:
Screenshot 2022-05-07 at 21 41 37

setCurrentClub looks like this, essentially pushing a new value into an rxjs ReplaySubject:

Screenshot 2022-05-07 at 21 42 38

This ends up cascading through various observables and tearing down and restarting a watchQuery somewhere.

@andreialecu
Copy link
Contributor Author

Actually the plot thickens.

I've changed some things so that those observables don't resubscribe. This prevented the "Observable cancelled prematurely" from happening, but the queries still hang. 🤔

@andreialecu
Copy link
Contributor Author

andreialecu commented May 7, 2022

Update: in QueryManager, getObservableFromLink seems to have the stuck query in inFlightLinkObservables_1 and because of deduplication it is not issued any more.

I think that's what's causing the issue.

Screenshot 2022-05-07 at 22 40 45

Now as to why it gets stuck there, will need to investigate further.

@andreialecu
Copy link
Contributor Author

So it appears that there's a callback being added to concast.cleanup() that is supposed to remove the observable from that inFlightLinkObservables map.

I think it's not being called most likely due to a race condition/or teardown situation similar to the one causing the "Observable cancelled prematurely" error.

The problem seems to be related to Concast not cleaning up properly:

Possibly relevant:

concast.cleanup(() => {
if (byVariables.delete(varJson) &&
byVariables.size < 1) {
inFlightLinkObservables.delete(serverQuery);
}
});

concast.cleanup(() => {
this.fetchCancelFns.delete(queryId);
if (queryInfo.observableQuery) {
queryInfo.observableQuery["applyNextFetchPolicy"]("after-fetch", options);
}
});

I can reproduce it consistently, if it would help to set up a screen sharing session at some point feel free to DM me on Twitter.

@andreialecu
Copy link
Contributor Author

One other relevant thing. When this starts happening, in QueryManager, there's a memory leak and all of these stuck queries remain in this queries map:

Screenshot 2022-05-08 at 12 36 23

@andreialecu
Copy link
Contributor Author

andreialecu commented May 8, 2022

I tried various versions, I wasn't able to reproduce this on 3.5.10 but it reproduces starting on 3.6.0. Also tried 3.7.0-alpha.2 and it happens there too.

@andreialecu
Copy link
Contributor Author

@benjamn if you have any suggestions which commits I could revert from 3.6.0 to try to isolate this, let me know, and I'll do it ASAP.

@benjamn
Copy link
Member

benjamn commented May 9, 2022

Thanks for all the details @andreialecu! I think you're on the right track, and the Observable canceled prematurely error is a long-standing hard-to-pin-down issue, so it would be great to fix that finally as well. 🤞

I can do some digging/testing today with this information. I'll report back when I have news.

@benjamn
Copy link
Member

benjamn commented May 9, 2022

Using your findings (thanks!), I was able to eliminate the Observable cancelled prematurely error, in favor of a quieter form cancellation/completion: #9701

@andreialecu
Copy link
Contributor Author

Awesome. Is there any quick way to test it to see if it makes any difference to the queries hanging issue?

I fear they're unrelated, since this started in 3.6.0, but the Observable cancelled prematurely error was a long standing issue.

benjamn added a commit that referenced this issue May 9, 2022
…` rejection (#9701)

Should fix/improve issues #7608 and #9690, and possibly others.
@benjamn
Copy link
Member

benjamn commented May 9, 2022

@andreialecu Quick way to test:

npm i @apollo/client@beta

That should install 3.7.0-alpha.3, which includes #9701. I believe these changes will be safe to release in 3.6.4, rather than waiting for v3.7, but it's a little easier to test things on release-3.7 before publishing/burning the 3.6.4 version.

@andreialecu
Copy link
Contributor Author

andreialecu commented May 9, 2022

@benjamn unfortunately the bug in the OP still exists in 3.7.0-alpha.3

Screenshot 2022-05-10 at 02 21 46

Notice how observable contains something, is deduplicated and not issued. Also notice how the Concast got to 41 (stuck) observers.

As mentioned previously, the issue started between 3.5.10 and 3.6.0 but there are too many commits for me to be able to pinpoint one easily. So if you have any clue which one might've touched anything in this area, please let me know so I can try reverting it.

Alternatively, if you have any suggestions where to set any breakpoints and what to inspect, that would also be great.

@patrickxrivera
Copy link

I noticed the same issue in 3.6.0 and downgraded to 3.5.10 which fixed it for me

@benjamn
Copy link
Member

benjamn commented May 10, 2022

@andreialecu A few more questions:

  1. Are those 41 stuck queries all the queries you've sent through getObservableFromLink, or only some?
  2. Is there (still) a similar problem with inFlightLinkObservables and/or byVariables? Do they get emptied by the concast.cleanup function correctly, or do they also grow over time?

@benjamn benjamn added the 🏓 awaiting-contributor-response requires input from a contributor label May 16, 2022
@bentron2000
Copy link

bentron2000 commented May 17, 2022

I am also having this issue.

Some queries hang are never dispatched with 3.6.* - but only when using BatchHttpLink - the issue does not occur when using HttpLink
Just tried 3.7.0-alpha.5 - no joy.

Last version I can get to work with my BatchHttpLink setup is 3.5.10

Don't have time to put together a repro - also it seems intermittent - but can answer questions if it helps...

@benjamn
Copy link
Member

benjamn commented May 17, 2022

@andreialecu Are you also using BatchHttpLink by any chance?

@bentron2000 Even if BatchHttpLink is not involved in the original issue, I think that still merits investigation, possibly in a new issue.

@andreialecu
Copy link
Contributor Author

Yep, I'm using batching, see my comment above. I'm using HttpBatchLink from apollo-angular, here's its implementation: https://github.com/kamilkisiela/apollo-angular/blob/master/packages/apollo-angular/http/src/http-batch-link.ts

I'm not sure if I'll be able to reproduce it in isolation, since it's random when it starts happening. It could be a race condition of some sort with the batching link. I will try to investigate further.

@bentron2000
Copy link

@andreialecu Are you also using BatchHttpLink by any chance?

@bentron2000 Even if BatchHttpLink is not involved in the original issue, I think that still merits investigation, possibly in a new issue.

I'm using React fwiw.

I'll watch and see what happens rather than clogging things up with a new issue for now. Will keep watching and commenting on solutions tho.

@doflo-dfa
Copy link

wow i'm glad I found this I was starting to think I was going insane 😄 useLazyQuery is hanging for us as well while using the most recent alpha-6

@doflo-dfa
Copy link

wow i'm glad I found this I was starting to think I was going insane 😄 useLazyQuery is hanging for us as well while using the most recent alpha-6

I don't think what we were experiencing is a bug but a side effect of the change ... I appears I was not using useLazyQuery correctly in relation to my centralized contextProvider which oddly worked before the switch to 3.6.2+ and the alphas ... we switched to alpha-6 to solve the Observable cancelled prematurely (in production only) against react 18.1.0 ... once we did all lazy queries hung ... never executing whilst using useLazyQuery (our queries that used useQuery worked just fine)

After re-ordering some things on my end to a HLC context provider ... requests started happening again but alpha-2 ( i had downgraded until useLazyQuery worked again ) was returning partial results on the first handful of queries ... eventually completing after subsequent runs, almost like it was returning early ditching out of data recieve ( cache issues ? ) ... after re-upgrading to alpha-6 everything started working correctly, which is good because now I can sleep again.

@nikhilgupta16
Copy link

Is there any update on the issue? I have not been able to upgrade from 3.5.6 citing various issues with every release

@jdmoody
Copy link

jdmoody commented May 27, 2022

I also have hanging queries as described in this issue. Some details:

  • I'm upgrading from 3.2.2 to 3.6.5 (other issues prevent me from upgrading to versions between 3.2 and 3.6)
  • I had the same issue with 3.7.0-alpha.4
  • I'm currently using React 18 in legacy mode with v3.2.2 without any issues
  • I'm using BatchHttpLink and the issue goes away when I replace BatchHttpLink with apollo client's HttpLink
  • It seems to always be the last useQuery call made that hangs (I can go into more detail about how I measured this if helpful).

This has been especially hard to debug because I haven't been able to reproduce it in a dev environment. I'm only able to reproduce it when deploying my app to a production-like environment. Then, when I try to attach a debugger to the box, I'm no longer able to reproduce it 😵

Afaict, this is the only issue blocking me from upgrading to v3.6, which is the only thing blocking me from using certain React 18 features.

It's also unclear to me whether all the behaviors described by folks in this issue are indeed the same issue. If it would be helpful for me to create a separate issue, or if there's any other info I can provide, please let me know 🙏🏻

@benjamn
Copy link
Member

benjamn commented May 31, 2022

@jdmoody I appreciate all the details, and while I share your uncertainty about how many different issues we're talking about here, I think there must be something wrong with BatchHttpLink, probably a bug that's been there for a while but was only revealed by other changes recently, so that's what I'm currently looking into.

@vieira
Copy link

vieira commented Jun 7, 2022

Hello, I have tried 3.6.6 which potentially resolves a possible cause for this issue (#9718) but after some testing the issue is still present.

As everyone else, we are also using BatchHttpLink and the last version that is working without this issue for us is 3.5.10.

@andrew-hu368
Copy link

andrew-hu368 commented Jun 7, 2022

I am not sure if it is related. We've recently upgraded from apollo v2 to apollo v3 (latest version). If I import the old BatchHttpLink from apollo-link-batch-http the queries are successfully run, while the new version doesn't send requests.

We downgraded to 3.5.10 and it seems to work as expected.

@benjamn
Copy link
Member

benjamn commented Jun 7, 2022

Alright, I believe this regression stems originally from PR #9248, which made the batch link capable of cancelling in-flight batched operations when the underlying observable is terminated.

This theory about the source of the regression is consistent with @andrew-hu368's comment about the old apollo-link-batch-http version of BatchHttpLink still working (😂), since that version does not contain PR #9248, which was released more recently (first in v3.6.0-beta.4, and then officially in v3.6.0).

While I am open to making the changes from PR #9248 more purely opt-in, I believe my PR #9793 fixes the problem without completely abandoning #9248.

@andreialecu @jdmoody @bentron2000 @doflo-dfa @nikhilgupta16 @vieira @andrew-hu368 (and anyone else I missed) Please try running npm i @apollo/client@beta to get version 3.7.0-beta.3, which includes #9793.

If that doesn't work, please try npm i @apollo/client@3.7.0-beta.2 (note: 2 not 3) to see if the full reversion of PR #9248 (described in #9793) makes any difference for you. If you see any differences in behavior between these two versions, please describe the differences here in detail. I'm hoping the simpler changes in @apollo/client@3.7.0-beta.3 are enough to fix the regression, without completely undoing #9248.

@andrew-hu368
Copy link

Run some e2e tests on 3.7.0-beta.3 and they are all green. Thanks for the update @benjamn

@jdmoody
Copy link

jdmoody commented Jun 8, 2022

I did some testing on @apollo/client@3.7.0-beta.3 and I can no longer reproduce the hanging queries 🙌🏻

Thanks Ben!

@andreialecu
Copy link
Contributor Author

Can also confirm that beta.3 seems to work.

I just changed to beta.0 to verify that I can still repro the error and I could very quickly.

Then upgraded to beta.3 and I wasn't able to reproduce it, at least not within a few minutes of testing - by that point it would definitely happen with previous versions.

Seems fixed. Feel free to close the issue after releasing it to stable, or at your convenience. Thanks @benjamn!

@bentron2000
Copy link

I also did the comparison between @apollo/client@beta (3) and @apollo/client@3.7.0-beta.0 and can confirm that beta.3 works where the beta.0 immediately has the original issue.
Seems fixed to me.
Thank you!

benjamn added a commit that referenced this issue Jun 10, 2022
PR #9793 was first released in v3.7.0-beta.3 for testing, and now (in
this PR) will be backported to the `main` branch, to be released in the
next v3.6.x patch version, fixing a `BatchHttpLink` regression
introduced in v3.6.0 that silently discarded some pending queries.

Evidence this worked:
- #9773 (comment)
- #9690 (comment)
- #9690 (comment)
- #9690 (comment)
- #9690 (comment)
@benjamn
Copy link
Member

benjamn commented Jun 10, 2022

Thanks for the confirmations everyone!

This should now be fixed if you run npm i @apollo/client@latest to get version 3.6.7 (or later).

@benjamn benjamn closed this as completed Jun 10, 2022
@damiangreen
Copy link

Hurrah , 3.6.9 fixes it

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.