Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe performance degradation with long running processes #20

Closed
vongruenigen opened this issue Dec 14, 2023 · 3 comments
Closed

Severe performance degradation with long running processes #20

vongruenigen opened this issue Dec 14, 2023 · 3 comments

Comments

@vongruenigen
Copy link

Hi there

I'm working at a company that is using a nest backend together with neo4j as a data store for business applications. Lately, we've decided to migrate away from drivine because we were facing a lot of race-condition / deadlock problems for a long time and decided to use the neo4j driver directly to mitigate those.

The transition was rather smooth, since all of our database operations were abstracted away behind a base repository class, so it was mainly a matter of replacing all usages of drivine with the Neo4jService provided by this package. Our first tests all seemed very promising and solved all the previous problems we had with using the other library.

But now comes the catch: Locally everything was running fine and even in productive environments. However, after some time (less than one day with production load), the responses we got from neo4j got slower and slower, up to a 10x increase, which made everything unbearable for our end users to use (and basically broke the frontend because of the 30s timeout we enforce). We're now investigating for quite some time already, but cannot find the root cause of the issue. One way to temporarily fix the issue is to restart the backend process, that's why we're now looking into our integration of neo4j.

We have several "suspects", and I hope that one of you can point us into the right direction or maybe you see something suspicious (sorted from most to least likely to be the cause, in my opinion):

  • We use the node cluster module in production, with at least 4 child processes being forked on startup. I'm pretty sure the driver is instantiated once at startup of the application and then either copied or made available to those child processes in some other fashion. I haven't found something obvious while googling and looking through the issues/code here and in the neo4j-js driver repository.

  • Since switching we had to implement our own transaction handling. We decided to go the easy route and wrap each request in a transaction (since we write to the database on each request anyway). Since we're using GraphQL on the API layer it can happen that a single HTTP request can result in several of the GraphQL endpoints being called. We start a transaction for each endpoint, hence we have a 1-n relation between HTTP requests and transactions run. Also we're using continuous-local storage (using nest-cls) to keep track which endpoint is wrapped in which transaction and to pass it on to Neo4jService#write calls, even when multiple requests are processed simultaneously.

  • We make use of as much concurrency as we can when implementing endpoints. This means we often use Promise.all(inputs.map(async () => { ... })) constructs to fetch data from the database (this is what was causing a lot of the problems with drivine previously). Again, I think this is most certainly not an issue since the neo4j driver itself should be able to handle all this, and I haven't seen anything to the contrary, be it here on in the neo4j-js driver.

Does any of the points mentioned above raise any red flag in your opinion? If not I'd probably try to create a reproduceable example repo and will post it here.

Thanks in advance! :-)

Some additional infos:

  • We use the following versions of related libraries (I know those are old versions, upgrade is on the way):
    • node: 14.x
    • nest: `8.x
    • neo4j-js: 4.x
    • neo4j: 4.4.14
@adam-cowley
Copy link
Owner

Hey @vongruenigen,

Queries getting slower and slower indicates a problem with an inefficient query or possibly missing indexes. How are you running Neo4j? You could check query.log on the neo4j instance to look for slow-running queries. You can take those, prepend with EXPLAIN and look for areas that may cause problems.

It's hard to say without seeing the graph though. I'm happy to have a call in the new year to investigate if you can wait that long - adam at neo4j dot com. I'd also suggest contacting your sales rep if you are running Enterprise edition.

@vongruenigen
Copy link
Author

Thranks for your reply @adam-cowley. The strange thing is that those particular queries haven't changed from one release (where everything was fine) to the next one (where we run into the performance issues).

We did quite a lot of investigation over the weekend and realised that the async local storage (that we use for transaction handling now) might be to blame. We're in the process of deploying the solution to all environments today.

I'll update (and hopefully close) this issue then.

@vongruenigen
Copy link
Author

Totally forgot to update this issue: After some investigation, we found out that the issue was not linked to this library, or any external library at all really, but the specific node version we were using, because prior to 14.15.2 there was a performance degradation over time when it came to tracking of promises (see the pull request that fixed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants