Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.6.1] 100% CPU on Ubuntu server, caused by client? #9796

Open
KoenLav opened this issue Apr 4, 2018 · 120 comments
Open

[1.6.1] 100% CPU on Ubuntu server, caused by client? #9796

KoenLav opened this issue Apr 4, 2018 · 120 comments
Labels
confirmed We want to fix or implement it Project:Server Performance

Comments

@KoenLav
Copy link
Contributor

KoenLav commented Apr 4, 2018

We are experiencing consistent 100% CPU usage on one of our Meteor deployments. This issue seems to have appeared out of nowhere (not after a of a new version).

What we already tried: create a new server with the same specs and deploy to that server. When we switch over our DNS to the new server at first all is well; but (we think) when a particular client connects the Node process starts using 100% CPU.

We're deploying Meteor to a Ubuntu host using MUP (which instantiates a Docker container consisting of the meteord base image and our bundle). The image has NodeJS 8.9.4 and NPM 5.6.0.

Any pointers as to how to pinpoint this issue would be greatly appreciated!

We believe this is the most interesting portion of the V8 profiler we ran on the logs (https://paste2.org/zPsHbDya):

   ticks parent  name
  2420054   92.5%  /opt/nodejs/bin/node
  219101    9.1%    /opt/nodejs/bin/node
  165262   75.4%      Builtin: ArrayMap
  57108   34.6%        LazyCompile: *changeField /bundle/bundle/programs/server/packages/ddp-server.js:287:25
  57067   99.9%          LazyCompile: *added /bundle/bundle/programs/server/packages/mongo.js:3650:23
  57042  100.0%            LazyCompile: *<anonymous> packages/meteor.js:1231:19
  48280   29.2%        LazyCompile: *EJSON.clone.v /bundle/bundle/programs/server/packages/ejson.js:646:15
  47783   99.0%          Function: ~Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:697:26
  47783  100.0%            Builtin: ArrayForEach
  23827   14.4%        LazyCompile: *Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:320:28
  23616   99.1%          LazyCompile: *added /bundle/bundle/programs/server/packages/mongo.js:3650:23
  23616  100.0%            LazyCompile: *<anonymous> packages/meteor.js:1231:19
  22206   13.4%        LazyCompile: *Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:697:26
  22205  100.0%          Builtin: ArrayForEach
  15641   70.4%            LazyCompile: *_sendAdds /bundle/bundle/programs/server/packages/mongo.js:1913:23
   3405   15.3%            LazyCompile: *EJSON.clone.v /bundle/bundle/programs/server/packages/ejson.js:646:15
   2442   11.0%            LazyCompile: *<anonymous> /bundle/bundle/programs/server/packages/mongo.js:1782:34
   6566    4.0%        Builtin: ArrayForEach
   4420   67.3%          LazyCompile: *_sendAdds /bundle/bundle/programs/server/packages/mongo.js:1913:23
   4420  100.0%            Function: ~<anonymous> /bundle/bundle/programs/server/packages/mongo.js:1782:34
   2082   31.7%          LazyCompile: *<anonymous> /bundle/bundle/programs/server/packages/mongo.js:1782:34
   2082  100.0%            LazyCompile: *<anonymous> packages/meteor.js:1231:19
   3827    2.3%        RegExp: ([!#\\$%&'\\*\\+\\-\\.\\^_`\\|~0-9a-z]+)(?:=(?:([!#\\$%&'\\*\\+\\-\\.\\^_`\\|~0-9a-z]+)|"((?:\\\\[\\x00-\\x7f]|[^\\x00-\\x08\\x0a-\\x1f\\x7f"])*)"))? {9}
   3812   99.6%          Function: ~Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:697:26
   3812  100.0%            Builtin: ArrayForEach
  25601   11.7%      LazyCompile: *ObserveHandle.stop /bundle/bundle/programs/server/packages/mongo.js:1955:41
  21756   85.0%        Function: ~<anonymous> /bundle/bundle/programs/server/packages/mongo.js:3663:25
  21756  100.0%          LazyCompile: *<anonymous> packages/meteor.js:1231:19
  19776   90.9%            Function: ~<anonymous> /bundle/bundle/programs/server/packages/ddp-server.js:1298:32
   1978    9.1%            LazyCompile: *<anonymous> /bundle/bundle/programs/server/packages/ddp-server.js:1298:32
   2805   11.0%        LazyCompile: *baseUniq /bundle/bundle/programs/server/npm/node_modules/lodash.union/index.js:742:18
   2805  100.0%          LazyCompile: *<anonymous> packages/meteor.js:1231:19
   2805  100.0%            Function: ~<anonymous> /bundle/bundle/programs/server/packages/ddp-server.js:1298:32
   1040    4.1%        LazyCompile: *<anonymous> /bundle/bundle/programs/server/packages/mongo.js:3663:25
   1040  100.0%          LazyCompile: *<anonymous> packages/meteor.js:1231:19
    535   51.4%            Function: ~<anonymous> /bundle/bundle/programs/server/packages/ddp-server.js:1298:32
    505   48.6%            LazyCompile: *<anonymous> /bundle/bundle/programs/server/packages/ddp-server.js:1298:32
   2944    1.3%      LazyCompile: *v.map.value /bundle/bundle/programs/server/packages/ejson.js:678:18
   2607   88.6%        Function: ~Socket._writeGeneric net.js:708:42
   2607  100.0%          LazyCompile: *Socket._write net.js:785:35
   1686   64.7%            LazyCompile: *ondata internal/streams/legacy.js:14:18
    495   19.0%            LazyCompile: *Connection.write /bundle/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/connection/connection.js:527:38
    419   16.1%            LazyCompile: *clearBuffer _stream_writable.js:469:21
    179    6.1%        Builtin: ArrayMap
     57   31.8%          LazyCompile: *EJSON.clone.v /bundle/bundle/programs/server/packages/ejson.js:646:15
     52   91.2%            Function: ~Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:697:26
      4    7.0%            Function: ~<anonymous> /bundle/bundle/programs/server/packages/mongo.js:1879:36
      1    1.8%            Function: ~changeField /bundle/bundle/programs/server/packages/ddp-server.js:287:25
     55   30.7%          LazyCompile: *changeField /bundle/bundle/programs/server/packages/ddp-server.js:287:25
     55  100.0%            LazyCompile: *added /bundle/bundle/programs/server/packages/mongo.js:3650:23
     31   17.3%          LazyCompile: *Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:697:26
     31  100.0%            Builtin: ArrayForEach
     20   11.2%          LazyCompile: *Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:320:28
     20  100.0%            LazyCompile: *added /bundle/bundle/programs/server/packages/mongo.js:3650:23
      5    2.8%          Builtin: ArrayForEach
      5  100.0%            LazyCompile: *_sendAdds /bundle/bundle/programs/server/packages/mongo.js:1913:23
      4    2.2%          RegExp: ([!#\\$%&'\\*\\+\\-\\.\\^_`\\|~0-9a-z]+)(?:=(?:([!#\\$%&'\\*\\+\\-\\.\\^_`\\|~0-9a-z]+)|"((?:\\\\[\\x00-\\x7f]|[^\\x00-\\x08\\x0a-\\x1f\\x7f"])*)"))? {9}
      4  100.0%            Function: ~Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:697:26
      4    2.2%          LazyCompile: *<anonymous> /bundle/bundle/programs/server/packages/mongo.js:1879:36
      4  100.0%            Function: ~SQp._run packages/meteor.js:851:21
      3    1.7%          LazyCompile: *callback zlib.js:447:20
      3  100.0%            Builtin: ArrayForEach
     60    2.0%        Function: ~<anonymous> /bundle/bundle/programs/server/packages/ddp-server.js:740:22
@alanning
Copy link
Contributor

alanning commented Apr 6, 2018

Not sure about the cause but perhaps try testing with a Galaxy deploy and see if you can recreate the problem there. Would add some extra data to help triangulate down to the root cause.

From the logs it looks like mongo communication is involved somehow so perhaps check the metrics on your mongo cluster and see if anything looks suspicious.

When you say "when a particular client connects" did you mean one particular customer or "one client" more like, "our desktop client"? If one particular client, have you been able to reproduce it in a test environment using their credentials/configuration?

@KoenLav
Copy link
Contributor Author

KoenLav commented Apr 8, 2018

@alanning our server setup consists out of multiple microservices (backoffice, operational, scheduled tasks, api, customer, etc.) and is hosted on DigitalOcean. Our database instance is hosted with MongoDB Atlas.

We managed to resolve the issue by either (1) shutting down ALL our production servers which were connected to MongoDB (at the same time) or (2) logging out all connected users (by updating all users). At this point we were spray painting with possible solutions, hoping something would stick.

Before doing so we tried several other options like shutting down our services one by one, but once the container which was experiencing the issue spun up again the issue reappeared (even when we deployed an instance on a completely new server). In addition to that we assumed some kind of infinite loop in our code (resulting from poorly written Collection Hooks) could be the issue, but none of these changes did any good.

I did not mean a specific client (like Cordova/Browser/Desktop) but rather one specific session on one device. This is what we assumed as the issue reappeared x seconds after restarting the instance where x would be longer if we left down the microservice longer (which would result in a longer timeout between reconnect attempts).

Judging from the lodge we concluded there would have to be a MongoDB related issue as well, we checked the logs and found some (limited, non-recurring) limited key errors which appeared right before the issue started, but we have not been able to reproduce this behavior.

As the issue is solved for us now I'm going to leave it at this, but it might be worthwhile to investigate which issues could result from a client continuously writing objects with duplicate keys to MongoDB (as this is the most probable scenario judging from all the details).

@KoenLav KoenLav closed this as completed Apr 8, 2018
@KoenLav
Copy link
Contributor Author

KoenLav commented Apr 10, 2018

Today the issue re-appeared; we would really like to find the source of this issue.

Any thoughts on finding the cause would be appreciated!

@KoenLav KoenLav reopened this Apr 10, 2018
@alanning
Copy link
Contributor

What version of Meteor are you running?

I would work towards reproducing the issue by replicating that problem client's data/session in your test environment. Once you can reproduce it then you can start isolating different parts to find the culprit.

Does MongoDB Atlas provide any relevant metrics for the timeframe in question?

@KoenLav
Copy link
Contributor Author

KoenLav commented Apr 11, 2018

Meteor 1.6.1.

We too would like to replicate the issues, but other than that it will probably have something to do with MongoDB. At present we don't even know how to reproduce the issue in production. We have a specific timestamp when the issue (at least, the high CPU load resulting from the issue) started, so we might be able to get some more information from the logs.

To that effect I searched all our Collections looking for objects which were created/updated within 3 minutes of the issue appearing (the CPU usage increases from 5% to 100% in a matter of seconds), but to no avail.

In addition to that we looked through the MongoDB Atlas metrics but they didn't show any significant increase in CPU, bandwidth, connections, whatsoever. This makes us think that, whereas the problem might be caused by some specific connection/data, the result is only measurable within the code. Could it be that Meteor think a document is updated all the time and attempts to publish changes to all clients, but it is actually not?

Are there any other profilers we could run or logging we could enable to further narrow down the issue?

@smeijer
Copy link

smeijer commented Apr 11, 2018

I would like to join in this investigation. I too am experiencing a similar issue. Production servers jump to 100% at random moments, but always during working hours.

First I thought I had a memory leak, or indeed an infinite loop somewhere. But after testing & debugging for a week, I still haven't found a thing.

Even putting the server under load with artillery doesn't bring it to his knees.

Are there any other profilers we could run or logging we could enable to further narrow down the issue?

Any handles you can give us would be highly appreciated. Running Meteor 1.6.1 on Ubuntu via PM2. pm2 restart all "fixes" the issue for some time.

@hwillson
Copy link
Contributor

hwillson commented Apr 11, 2018

I have a sneaking suspicion that the issues listed here are caused by Fiber spikes. See #7747 for more details. This is a known issue and still happens with current day Meteor (the repro in that issue still shows the Fiber spike after updating to 1.6.1.1). That issue was closed due to inactivity, but if anyone is interested in digging into this further, the details in that issue should help.

@KoenLav
Copy link
Contributor Author

KoenLav commented Apr 11, 2018

@hwillson interesting!

In our particular app clients are subscribed to +/- 15 collections out of which an entry into one collection causes a cascading insert and update on other collections.

Is there some way to easily try out the patch which sets the fiber pool size?

Also could you sneak in a peek into the logs (or is your suspicion already based on those?) and share what you make out of of them?

To me it looks like Meteor thinks that there are constantly changes in the data, while this is in actuality not the case.

@KoenLav
Copy link
Contributor Author

KoenLav commented Apr 11, 2018

@smeijer just to make sure we are experimenting the same issue; the CPU load increases to 100% within a matter of seconds and remains there.

The issue disappeared upon restarting Meteor (but in our initial scenario it resurfaced right away, whereas now it has been gone for a day).

Do you also make use of quite a few subscriptions, or update a significant amount of documents in a collection to which some clients are subscribed at some point?

@smeijer
Copy link

smeijer commented Apr 11, 2018

@KoenLav,

Yes. The CPU increases to 100% within a matter of seconds. Below you'll find a CPU log of the past two weeks. I've tried letting it run, to see if it come down it selves, but that didn't happen.

(50% is 100% of 1 of the 2 cores)
image

There is no other process running beside the meteor bundle and pm2. And you'll see on the chart that there is power enough for normal operations.

Yes, the issue disappeared upon restarting Meteor pm2 restart x (where x is 0 or 1, depending on which instance is stuck). I'm running a total of 6 instances divided over 3 servers. I never had two servers "crash" at the same time, but after a restart of the stucked one, there is a bigger chance that another instance does get stuck in the next couple of hours. But this isn't a given fact.

Regarding to the subscriptions; I'm in a process of migrating away from publish / subscribe / methods, while going full graphql. So there aren't a lot of subscriptions in my application, but the ones that are still there, use quite a bit of data (geojson transport).

I'm not sure what you feel significant, but I wouldn't say so. I have a write-heavy application. And yes, these are the "busy hours". But you see on my CPU chart that there should be performance enough.

@smeijer
Copy link

smeijer commented Apr 11, 2018

Is there a way I can count / log the number of active fibers? To see if / when I reach the 120 named in the Fiber thread?

@hwillson
Copy link
Contributor

@KoenLav It's just a hunch - I'll take a closer look at the logs shortly.

@smeijer Try logging Fiber.fibersCreated:

import Fiber from 'fibers';
console.log(`Current fiber count: ${Fiber.fibersCreated}`);

@KoenLav
Copy link
Contributor Author

KoenLav commented Apr 11, 2018

@hwillson thanks! One questions: does this log the number of active fibers or the number of fibers created (all time)?

I believe it to be the latter, according to this: laverdet/node-fibers#85

But I'm not entirely sure how Meteor handles Fibers locally, so these numbers might in fact be the same thing.

@smeijer this might also be interesting: laverdet/node-fibers#91

@KoenLav
Copy link
Contributor Author

KoenLav commented Apr 11, 2018

Added:

import Fiber from 'fibers';

setInterval(() => {
  console.log(`Current fiber count: ${Fiber.fibersCreated}`);
}, 60 * 1000);

To our /server/main.js file in production. Let's see what happens.

@smeijer
Copy link

smeijer commented Apr 11, 2018

I'll read those two issues tomorrow, but to share some results already. After a cold start, and just clicking away in my local dev environment:

fibersCreated 331 
poolsize 120

heapUsed 108.173 MiB 
heapTotal 118.438 MiB 
rss 148.547 MiB 

I don't see the fibersCreated go down after closing my browser session. So at first sight, it seems to be an 'all-time'. But it got there so fast, that it can still be a problem. I have no idea. A restart, and opening directly on a specific page, resulted in 271 fibers under the fibersCreated property. (still not saying anyting if they are being disposed / reused correctly.)

I do use async / await heavily on the graphql part. And triggering 'fire and forget' actions there. (mongo update, and not waiting for results).

import numeral from 'numeral';
import Fiber from 'fibers';

import logger from '../logger';

setInterval(() => {
  const { rss, heapTotal, heapUsed } = process.memoryUsage();
  const { fibersCreated, poolSize } = Fiber;

  log.trace(
    { rss, heapTotal, heapUsed, fibersCreated, poolSize },
    'heapUsed %s heapTotal %s rss %s fibers %d poolSize %d',
    numeral(heapUsed).format('0.000 ib'),
    numeral(heapTotal).format('0.000 ib'),
    numeral(rss).format('0.000 ib'),
    fibersCreated,
    poolSize,
  );
}, 5000).unref();

@xet7
Copy link
Contributor

xet7 commented Apr 11, 2018

I'm using Meteor 1.6.0.1, having same problems, adding this comment to follow this issue.

@KoenLav
Copy link
Contributor Author

KoenLav commented Apr 11, 2018

@xet7 does your problem also occur randomly and do you suspect it to be related to subscriptions/observing changes?

Do you have any more information that is relevant from the other thread(s)?

@xet7
Copy link
Contributor

xet7 commented Apr 11, 2018

@KoenLav

Setup is like mentioned at https://github.com/wekan/wekan-mongodb/issues/2#issuecomment-380606330

Also when tested with 4 EC2 instances without Docker, result is same, node has 100% CPU usage.

Oplog should be setup correctly, I'm not sure.

Problem is with Wekan https://wekan.github.io and all other Meteor.js projects, on powerful servers, and on Raspberry Pi small projects, so all new projects start with some other than Meteor.js, because of problems with Meteor.js CPU usage.

@aboire
Copy link

aboire commented Apr 12, 2018

@KoenLav
Copy link
Contributor Author

KoenLav commented Apr 12, 2018

@hwillson could you provide some more information on what sort of scenario's could cause such a Fiber spike?

This could help us find the offending code portions in our application code.

@xet7
Copy link
Contributor

xet7 commented Apr 12, 2018

Is this still the same fiber issue that can be reproduced with this?
#7747 (comment)

@smeijer
Copy link

smeijer commented Apr 12, 2018

So I guess this is a way to track the currently active fibers:

import numeral from 'numeral';
import Fiber from 'fibers';

const runFiber = Fiber.prototype.run;

Fiber.fibersActive = 0;
Fiber.prototype.run = function runWithCount(...args) {
  const wasStarted = this.started;

  try {
    runFiber.apply(this, args);
  } finally {
    Fiber.fibersActive +=
      wasStarted && !this.started ? -1 : !wasStarted && this.started ? +1 : 0;
  }
};

setInterval(() => {
  const { fibersActive, fibersCreated, poolSize } = Fiber;

  console.log(`fibersActive ${fibersActive}/${poolSize} fibersCreated ${fibersCreated}`);
}, 5000).unref();

Logging statements like:

fibersActive 1/120 fibersCreated 3
fibersActive 5/120 fibersCreated 8
fibersActive 10/120 fibersCreated 17
fibersActive 13/120 fibersCreated 17
fibersActive 21/120 fibersCreated 264
fibersActive 21/120 fibersCreated 386

I had a case where fibersActive was -1, but I guess there are already active fibers before the wrap is applied. Not a big issue for now. As it's still giving useful insights, just with a 1 or 2 offset.

@xet7
Copy link
Contributor

xet7 commented Apr 12, 2018

I tried the repro code here:
https://github.com/jparyani/meteor-fiber-repro

  1. Upgraded to Meteor 1.6.0.1:
meteor update --release 1.6.0.1
  1. Added babel
meteor npm install --save babel-runtime
  1. Started it
meteor run
  1. Webbroser at http://localhost:3000 has this web interface:

repro-app

  1. Clicked left button, and then clicked right button. Those are visible as left and right spike at orange CPU usage area:

repro-cpu

  1. On console there is this log:
$ meteor run
[[[[[ ~/repos/meteor-fiber-repro ]]]]]        

=> Started proxy.                             
=> A patch (Meteor 1.6.1.1) for your current release is available!
   Update this project now with 'meteor update --patch'.
=> Started MongoDB.                           
I20180412-11:05:12.832(3)? 4                  
=> Started your app.

=> App running at: http://localhost:3000/
I20180412-11:05:17.832(3)? 5
I20180412-11:05:22.833(3)? 5
I20180412-11:05:27.834(3)? 5
I20180412-11:05:32.835(3)? 5
I20180412-11:05:37.835(3)? 5
I20180412-11:05:42.835(3)? 5
I20180412-11:05:47.835(3)? 5
I20180412-11:05:52.835(3)? 5
I20180412-11:05:57.836(3)? 5
I20180412-11:06:02.837(3)? 5
I20180412-11:06:07.837(3)? 5
I20180412-11:06:12.837(3)? 5
I20180412-11:06:17.837(3)? 5
I20180412-11:06:22.837(3)? 5
I20180412-11:06:27.838(3)? 5
I20180412-11:06:32.838(3)? 5
I20180412-11:06:37.838(3)? 5
I20180412-11:06:42.841(3)? 5
I20180412-11:06:47.842(3)? 5
I20180412-11:06:52.843(3)? 5
I20180412-11:06:57.844(3)? 5
I20180412-11:07:02.845(3)? 5
I20180412-11:07:07.845(3)? 5
I20180412-11:07:12.846(3)? 5
I20180412-11:07:17.847(3)? 5
I20180412-11:07:22.848(3)? 5
I20180412-11:07:27.850(3)? 5
I20180412-11:07:32.850(3)? 5
I20180412-11:07:38.680(3)? 5
I20180412-11:07:43.681(3)? 5
I20180412-11:07:48.683(3)? 5
I20180412-11:07:53.684(3)? 5
I20180412-11:07:58.685(3)? 5
I20180412-11:08:03.815(3)? 540
I20180412-11:08:12.340(3)? 4420
I20180412-11:08:15.584(3)? set complete
I20180412-11:08:17.384(3)? 9155
I20180412-11:08:22.397(3)? 9155
I20180412-11:08:27.399(3)? 9155
I20180412-11:08:32.398(3)? 9155
I20180412-11:08:34.261(3)? unset complete
I20180412-11:08:37.398(3)? 9155
I20180412-11:08:42.399(3)? 9155
I20180412-11:08:47.399(3)? 9155
I20180412-11:08:52.399(3)? 9155
I20180412-11:08:57.400(3)? 9155
I20180412-11:08:58.500(3)? set complete
I20180412-11:09:03.110(3)? 12354
I20180412-11:09:08.499(3)? 17928
I20180412-11:09:13.500(3)? 17959
I20180412-11:09:18.502(3)? 17959
I20180412-11:09:23.501(3)? 17959
I20180412-11:09:28.502(3)? 17959
I20180412-11:09:33.503(3)? 17959
I20180412-11:09:38.504(3)? 17959
I20180412-11:09:43.505(3)? 17959

@smeijer
Copy link

smeijer commented Apr 12, 2018

Running artillery on my local environment. I'm simulating the HTTP get request to get an initial page, which is using onPageLoad to preload a redux state (only fetching the current user by logintoken), and fetching data through the matching graphql endpoint. Something like:

- http.get /blog/$id
- graphql.query blog(id: $id)
- http.get /blog/$id/comments
- graphql.query comments(blogId: $id)

Turns out, the number of fibers is running trough the roof. Although, it does return below the pool size, the used memory does not.

image

I've scaled the fibersCreated down by a factor 1000 to keep the chart readable.

Worth mentioning, I've killed artillery before it finished. Because it started to be terribly slow. I'll try tuning the number down. But I think it will be useful to run this code in production, to get some real data.

@smeijer
Copy link

smeijer commented Apr 13, 2018

So I've been logging production for a day now, and the number of active fibers are settled around ~25. (based on the code posted above).

In the meanwhile, I've also scaled one server up. I've been running the same amount of instances as the server has cores (pm2 instances: max), and I've now scaled one server up to have 1 spare CPU (pm2 instances: -1).

I've been playing a bit with the --max_old_space_size flag on my dev machine, and when that's to low, I'm getting the same symptoms. So it looks like the garbage becomes just too much for the CPU to handle.

That does also confirm another scenario. To debug this issue, I've implemented a pmx trigger to create a heap snapshot and a heap profile on request. But when requesting a heap snapshot the CPU has a big chance of getting stuck. Not all the time, but 1 out of 4 for sure.

As I've been restarting the services quite often these days to implement debugging options/logging statements, the services haven't gone stale in the last 24 hours. I guess I have to wait now.

@KoenLav
Copy link
Contributor Author

KoenLav commented Apr 13, 2018

Stable at 37 or 38 fibers here.

@smeijer
Copy link

smeijer commented Apr 13, 2018

Creating a heap-dump in dev mode (on a windows pc), throws an exception here.

Exited with code: 3221225477

I've searched for this error, and found a good explanation at stackoverflow

The error code 3221225477 is 0xC0000005 in hex, which on Windows is:

#define STATUS_ACCESS_VIOLATION ((NTSTATUS)0xC0000005L)

Access violation is Windows' version of "segmentation fault", which simply said means that the program tried to access a memory which is not allocated. This can happen for a lot of different reasons, but mostly (if not always) is a bug in the program.

So can this issue be related to the other segmentation bugs reported here? There are a lot lateley due to a bug in node. I have no idea why I don't see this error code on production. But the case is, on my dev machine a heap-dump kill's meteor, with this error, and restarts the instance.

On production, sometimes the snapshot works, sometimes it causes the CPU to hang at 100%. But when the snapshot was made successful, the process is restarted somehow. Can a segmentation fault cause a CPU to get stuck at 100%?


For reference, my heap-dumping-code:

import heapDump from 'heapdump';
import path from 'path';

export function writeSnapshot(savePath) {
  try {
    const filename = `${Date.now()}.heapsnapshot`;

    heapDump.writeSnapshot(path.join(savePath, filename), () => {
      console.log('Saved heap snapshot to disk.');
    });
  } catch (ex) {
    console.error('Err saving heap snapshot');
  }
}

setTimeout(() => {
  setInterval(() => {
    writeSnapshot(process.env.PROFILE_DATA_PATH);
  }, ~~process.env.HEAP_DUMP_INTERVAL * 1000).unref();
}, ~~process.env.HEAP_DUMP_INTERVAL * 1000);

@KoenLav
Copy link
Contributor Author

KoenLav commented Dec 7, 2018

@benjamn the logic being that the issue in the issue in this thread also triggers a lot of Fibers which are never returned to the pool.

Maybe a nice start to verify this would be running the reproduction listed in the older Fiber spike issue mentioned by hwillson?

@vinaykevadia
Copy link

@KoenLav thanks for your effort on this,
I am running meteor on EC2 (with MUP docker deployment tool) and Mongodb atlas,
I am facing the same problem unexpected CPU hang on EC2 , option is to reboot to put it on normal stage.
Do you guys find any solution? as i was going through this discussion, and was not able to find any particular solution for this issue.

@KoenLav
Copy link
Contributor Author

KoenLav commented Jan 4, 2019

@vinaykevadia we managed to mitigate the issue in one of our services by limiting the amount of 'hard calculation' we are performing there.

We are however currently experiencing this issue again, in another service where reports are generated based on a decent amount of data. I'm working on profiling this specific issue now.

The only thing that has helped for us, so far, is spray painting with solutions and hoping somethings sticks.

@smeijer
Copy link

smeijer commented Jan 4, 2019

Do you guys find any solution?

I solved it by removing almost all publications, and replace them by polling. Try to find your most expensive publication, and get rid of that one first.

#9796 (comment)

@smeijer
Copy link

smeijer commented Jan 4, 2019

@KoenLav, I am a bit curious about your situation though.

Previously we subscribed to this collection at startup, and clients could remain running that same subscription for weeks if not months. Now we refresh the subscription every 6 hours (by stopping it and resubscribing).
...
Luckily we managed to isolate the client which performed this action relatively quickly, and the issue disappeared the second I closed the browser window on their machine.

Do I understand correctly, that your users (/clients), keep their browser open for several months? No reboot, or even a simple F5 browser refresh?

I think that if something like this could indeed be a problem, that Meteor should add the (optional) refreshing of subscriptions on an given interval to the core. If I understnad it correctly, and that's what your problem fixed.

@KoenLav
Copy link
Contributor Author

KoenLav commented Jan 6, 2019

@smeijer we provide a cash register system for hospitality establishments based on a combination of Meteor, React, Electron and Cordova (slowly migrating to React Native). Some of these cash registers (behind a bar, for instance) will just be "always on". Hence subscriptions are never refreshed (unless we do so manually).

Before we used to have a monolithic application, as such reports and the like were also generated in the same application (which was just poor design) and this may have been the cause of the issue there (rather than the long-standing publications).

Now we have migrated these reports to the backoffice service, which is accessed via the browser (rather than Electron/Cordova), and the problem reappeared. Hence it's unlikely to have anything to do with (long-standing) publications (in this specific case). We have isolated the cause of the issue, which is generating a big (but nog huge) report based on sales data for the past year, and have instrumented the application to allow for easy profiling based on the qualialabs:profile package by @veered.

In my opinion the issue isn't that the CPU spikes to 100% (I mean, that can just happen, especially on computationally heavy tasks), but rather the fact that it remains at 100% CPU even after completing the task (which seems to be happening right now).

I intend to provide a profile of the application while it is performing the heavy computations (which I don't think will render much, if any, interesting results) and one after it has finished the heavy computations (which I expect will show some kind of endless loop in Meteor/Fiber/Node/V8 internals).

@evolross
Copy link
Contributor

evolross commented Jan 7, 2019

This isn't meant to be a solution to the problem, but as a work-around and perhaps a more optimized approach to generating these reports, have you considered using Meteor Methods to gather and return the data? Why use pub-sub and "real-time" data for a static report that is pulling a year's worth of data?

@KoenLav
Copy link
Contributor Author

KoenLav commented Jan 7, 2019

@evolross this may not have been clear from my earlier comment, but generating the report is the issue (not subscribing to it afterwards).

The issue also appears when we only generate and don't even subscribe at a later point in time.

Your point is valid nonetheless, from an architectural perspective, but not relevant for this specific issue.

Besides the end result of aggregated data is still less than 128KB, so subscribing shouldn't be an issue (although I agree it's also not necessary).

@ephemer
Copy link

ephemer commented Apr 26, 2019

This probably isn't helpful because you mentioned nothing has changed but I've seen issues like this time and time again in our apps when the size of certain publications gets too big and/or when there are too many subscribers to large publications.

Just yesterday I refactored one of our apps that was subscribing to a (static) collection that has grown over the years to hundreds of thousands of documents for admin users (because they could see everything). Generally performance was ok, but one admin signing in brought the entire server to a halt for minutes on end, if not for good.

By getting the data via a Meteor.method and using a mongo aggregation on the sever side for good measure, these CPU spikes are (again) a thing of the past. That said, I didn't refactor another publication that only contains around 300 documents and it alone causes a spike of 50% CPU usage on our small server instance upon subscribing (again, admins only - everyone else only gets a handful of docs). They are not big documents either- in total the data that gets to the server is only about 120kb, like in your instance.

So, based on many experiences with this over the past 5 years I'd strongly recommend for any production app to use Meteor Methods (or these days just GraphQL) for anything that doesn't absolutely need updates in real-time.

@evolross
Copy link
Contributor

@ephemer What are you using to do a "mongo aggregation on the server side"? Is this a certain package?

@xet7
Copy link
Contributor

xet7 commented Apr 26, 2019

In Wekan https://wekan.github.io , these fixes were implemented to improve performance:

  1. Add more indexes to MongoDB Performance Enhancements wekan/wekan#2214 . This did lower MongoDB CPU usage a lot, and make Wekan a lot faster, so Wekan is usable for large amount of users.

  2. Load only those cards from MongoDB that are visible, having infinite scroll Use infinite-scrolling on lists wekan/wekan#2144 and Fix infinite scroll wekan/wekan#2287 . These made possible to load large kanban boards that were originally imported from Trello to Wekan.

There is still a lot more possibilities to improve, like for example using new MongoDB 4.x features like aggregate queries. Wekan has Meteor 1.8.1 and MongoDB 4.0.7 in testing at meteor-1.8 branch, so it works in Docker, so when sometime I get Meteor 1.8.1-based Wekan also working in Snap and Sandstorm then it's possible to start looking at aggregate queries, or even earlier, but currently Wekan is fast enough so it's not urgent yet. Currently Snap and Sandstom Wekan still uses Meteor 1.6.x.

@KoenLav
Copy link
Contributor Author

KoenLav commented Apr 28, 2019

@ephemer we still have a subscription to a collection which receives a lot of inserts, updates and deletes. However we cancel the subscription and resubscribe every 6 hours, limiting the 'exposure' to this potential issue.

Additionally we used to provide a reporting service (which was poorly written to begin with) within our operational service; this reporting service could clog up the entire server leaving no room to process the oplog, which caused another issue (fallback to polling).

All-in-all we noticed that subscriptions perform poorly when they receive either a large number of events or limited events with large payloads.

@stale
Copy link

stale bot commented Dec 10, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale-bot label Dec 10, 2019
@filipenevola filipenevola added the confirmed We want to fix or implement it label Dec 12, 2019
@stale stale bot removed the stale-bot label Dec 12, 2019
@KoenLav
Copy link
Contributor Author

KoenLav commented Dec 29, 2019

So I noticed @vlasky mentioning a performance regression in Meteor 1.9 (Node 12) and I thought let's check how the Fiber bomb repro performs on 1.9 vs 1.8:

Press set and unset 5 times (set, unset, set, unset, etc.), 10 presses total (5 each), this will create +/- 100.000 fibers (depending on system speed).

On Meteor 1.9 the CPU will be 100% for 180 sec
On Meteor 1.8 the CPU will be 100% for 60 sec

I think, again, there is definitely an issue in Node.js which needs resolving, but I would like to repeat that I think Meteor should also make improvements, like:

#10478

You can see the performance improvements from that PR in the Fiber repro repository clearly.

@elie222
Copy link
Contributor

elie222 commented Jan 23, 2020

So what's the story with this? Definitely a blocker for us to upgrade to Meteor 1.9.

@vlasky
Copy link
Contributor

vlasky commented Jan 23, 2020

@elie222 the problem has been solved. The release version of Meteor 1.9 runs well.

@ephemer
Copy link

ephemer commented Feb 11, 2020

@evolross sorry I didn't get notified about your question here. In case you're still wondering, a mongo aggregation on the server side performed in a method looks like this (no libraries etc. required):

const MyCollection = new Mongo.Collection("my-collection");

Meteor.methods({
  myMethod(params) {
    return MyCollection.rawCollection().aggregate([
      { $match: { _id: { $in: params.ids } } },
      { $limit: 10000 }
      { $group: { _id: { ...blah... } } },
      ...etc
    ])
    .toArray(); // returns a Promise, which Meteor can handle natively
  }
});

@stale
Copy link

stale bot commented Oct 31, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale-bot label Oct 31, 2020
@smeijer
Copy link

smeijer commented Oct 31, 2020

Has this issue been fixed? I doubt that closing this one due to inactivity alone, is the right thing to do.

@stale stale bot removed the stale-bot label Oct 31, 2020
@KoenLav
Copy link
Contributor Author

KoenLav commented Oct 31, 2020

Nope.

The underlying issue in Node.js which caused fiber bombs to bring the server to a crawl has been resolved, but fiber bombs still exist and the reproduction repo still shows hugea huge slowdown when a large amount of documents are updated when subscribed to.

My PR, outstanding for about two years now, addresses this issue.

@filipenevola
Copy link
Collaborator

We don't want to hide/close unsolved important issues, this was just a rename of labels that is already finished and stale-bot is working as expected again.

@StorytellerCZ StorytellerCZ linked a pull request Aug 19, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
confirmed We want to fix or implement it Project:Server Performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.