big collections redux #14

javaknight · 2014-09-08T16:33:22Z

I need this to watch a big collection, but I don't want to actually publish the big collection to the client. I just need the count of the collection to be published, and if that count changes, then I need the change to be reactive and sent to the client.

Currently when I fire this up on a large collection, it just crashes my server after showing me the initial count.

dburles · 2014-09-09T03:20:10Z

hey @javaknight do you have any more info on the crash? and how many records are in the collection?

tmeasday · 2014-09-10T04:04:30Z

If the collection is truly large, it might be better to do something a little different:

Meteor.publish('count', function() {
  var self = this, first = true;
  var count = function() {
    var thisCount = Collection.find().count();
    if (first) {
      self.added('counts', 'X', {count: thisCount});
    } else {
      self.changed('counts', 'X', {count: thisCount});
    }
    first = false;
  }
  var timeout = Meteor.setInterval(count, 1000); // every 1s
  count();
  self.ready();

  self.onStop(function() {
    Meteor.clearTimeout(timeout)
  });
});

Of course ideally you'd share the timeout between multiple users subbing to the same publication. Sounds like a whole package of it's own :)

tmeasday · 2014-09-10T04:05:59Z

(This package pulls down an caches the _id from every record. If there are a lot of them, this is a terrible idea. But it allows it to be truly realtime).

chhib · 2014-09-17T08:32:55Z

@tmeasday: I think you mean Meteor.setInterval instead of Meteor.setTimeout.

tmeasday · 2014-09-22T01:22:59Z

Ahh, thanks @chhib - I updated the code so people aren't confused.

colllin · 2014-11-24T22:33:55Z

@tmeasday Why is it necessary to cache the _id from every record? Compared to just incrementing on added and decrementing on removed?

tmeasday · 2014-11-24T22:37:25Z

@colllin if you are talking about livedata's added and removed--well it'll need to cache the _id to work properly anyway. The underlying reason is basically timing issues on the oplog -- if the server sees an oplog inserted message it needs to check that it hasn't already counted that document (thus the cached _id) otherwise there are edge cases in which double counting could happen.

That's my understanding of it anyway. Possibly someone could figure out a way to use the low-level oplog driver and deal with these issues, not sure.

colllin · 2014-11-25T02:56:10Z

@tmeasday Yes, that's what I was talking about. I didn't realize observe()ing added and removed documents was imperfect (could send duplicate events)... interesting. Thank you for the explanation.

tmeasday · 2014-11-25T02:58:18Z

To be clear it's the oplog that is imperfect (I think there are a bunch of issues around the exact timing of doing your initial query vs where you start observing the oplog from).

.added() and .removed() in livedata are "perfect", but have the aforementioned performance caveat (you don't want to do them on a huge cursor).

jchristman · 2015-02-15T15:48:41Z

@javaknight, I had this same problem because my collection at 100,000+ rows - I am implementing a "scrollbox" that loads a sliding window over a collection to emulate the browser loading the entire collection. I implemented the solution @tmeasday posted above at https://github.com/jchristman/meteor-collection-scroller/blob/master/lib/collections.js if you wanna check it out (also at http://scroller.meteor.com). Atmosphere Link

faceyspacey · 2015-03-03T04:26:00Z

@tmeasday regarding your setInterval example, instead couldn't you just make the observer based on a cursor that finds a limit of one row, sorted by newest to oldest, and then increment the count only when needed rather than by interval. And of course call Collection.find().count() just once at the beginning. And then set the removed observer as usual. You'd just need to accept the collection as an argument instead of a cursor, perhaps a collection plus selector plus dateColumn.

Counts.publish = function(self, name, collection, selector, dateColumn, options) {
     var sort = {};
     sort[dateColumn] = -1;
     var count = collection.find(selector, {sort: sort, limit: 1}).count();

     var observers = {
      added: function(id, fields) {
        count += 1;
        if(!initializing) self.changed('counts', name, { count: count });
      },
      removed: function(id, fields) {
        count -= 1;
        self.changed('counts', name, { count: count });
      }
    };

    //etc
};

tmeasday · 2015-03-03T05:10:01Z

@faceyspacey - Seems like a good idea for collections where you do have a date field to work with.

I'm not sure that the removed will work however? What if I remove a document that isn't the latest?

faceyspacey · 2015-03-03T05:30:15Z

is there really no way for meteor's observers to skip calling all the added handlers on first run. Like a way internal to how meteor's observeChange's work. It seems everyone is doing the !initializing thing. ...I guess another cursor without a limit could be created just for the removed observer. Collection.remove could be overwritten to somehow notify this code--obviously that won't address direct changes to the mongo collection outside meteor code. The first solution seems fine to me. Whatchu think?

tmeasday · 2015-03-03T05:36:01Z

Well you'll have the problem of a huge cursor again. Which means an unacceptably large data set cached on the server.
Nope, doing anything off the oplog isn't going to work if you horizontally scale.

faceyspacey · 2015-03-03T05:44:07Z

then i guess overwriting collection.remove is the only answer, coupled with a rest API endpoint to ping if you remove rows outside of meteor. You just have to ping that API every time you directly remove rows. For me--and I'm willing to bet the vast majority of meteor developers--we wouldn't even need that. Maybe just a simple reset() method to call from time to time.

faceyspacey · 2015-03-03T05:50:12Z

so I guess collection.remove would store in another collection the name of the collection (only if a count was published for the collection). No more than the collection name would need to be stored. And then in Count.publish we just observe this collection for new added documents (selecting only documents that have the appropriate collection name), and when found decrement the count. We would also remove the row from this auxiliary collection after we decrement the count so it too is not very large (never more than one row lol).

tmeasday · 2015-03-03T06:39:48Z

@faceyspacey if you are going to think about wacky solutions like this, I'd suggest just denormalizing the count somewhere.

faceyspacey · 2015-03-03T07:04:25Z

well then just resetting the count on removes would be the solution. using a counts collection with the count from one publication denormalized into one row there.

emmanuelbuah · 2015-03-09T16:28:59Z

@tmeasday using setInterval can also be improved by keeping track of the previous count and only sending data to the client if the current (thisCount) is different from the previous count.

emmanuelbuah · 2015-03-09T16:38:52Z

Knowing the current limitation of oplog in combination with the exisitng observer api, I think the best solution for scale is to compute and store counts (in a mongodb collection or in the relevant doc) on insert and remove. Ex. On adding or removing comments from a post, update comments counter storage (possible on post, ie. post.commentsCount). This might look silly but works and scale very well.

Slava · 2015-08-02T18:01:41Z

I think we could make this better if we do the following steps:

Separate Oplog-tailing from Mongo-LiveQuery (the one that keeps the working set in memory)
Use the new API to stream the data through publish-counts but not store any cache
Implement a probabilistic data structure like HyperLogLog that can calculate set cardinality w/o storing the whole set in memory
Implement the naive approach with the fallback to HLL on big numbers using cursor#count method.

sean-stanley · 2016-09-15T23:58:55Z

Just one point I'm a bit unclear on. If I have a large collection but only want to count a small number of these (like unread notifications for a particular online user not all notifications for all users). Then I am only caching the documents in the cursor right not the entire collection so this package would work very well for counting small numbers of things.

However I suppose if I had 500 online users each with only 10 unread notifications I'd still be caching 5000 documents right?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

big collections redux #14

big collections redux #14

javaknight commented Sep 8, 2014

dburles commented Sep 9, 2014

tmeasday commented Sep 10, 2014

tmeasday commented Sep 10, 2014

chhib commented Sep 17, 2014

tmeasday commented Sep 22, 2014

colllin commented Nov 24, 2014

tmeasday commented Nov 24, 2014

colllin commented Nov 25, 2014

tmeasday commented Nov 25, 2014

jchristman commented Feb 15, 2015

faceyspacey commented Mar 3, 2015

tmeasday commented Mar 3, 2015

faceyspacey commented Mar 3, 2015

tmeasday commented Mar 3, 2015

faceyspacey commented Mar 3, 2015

faceyspacey commented Mar 3, 2015

tmeasday commented Mar 3, 2015

faceyspacey commented Mar 3, 2015

emmanuelbuah commented Mar 9, 2015

emmanuelbuah commented Mar 9, 2015

Slava commented Aug 2, 2015

sean-stanley commented Sep 15, 2016

big collections redux #14

big collections redux #14

Comments

javaknight commented Sep 8, 2014

dburles commented Sep 9, 2014

tmeasday commented Sep 10, 2014

tmeasday commented Sep 10, 2014

chhib commented Sep 17, 2014

tmeasday commented Sep 22, 2014

colllin commented Nov 24, 2014

tmeasday commented Nov 24, 2014

colllin commented Nov 25, 2014

tmeasday commented Nov 25, 2014

jchristman commented Feb 15, 2015

faceyspacey commented Mar 3, 2015

tmeasday commented Mar 3, 2015

faceyspacey commented Mar 3, 2015

tmeasday commented Mar 3, 2015

faceyspacey commented Mar 3, 2015

faceyspacey commented Mar 3, 2015

tmeasday commented Mar 3, 2015

faceyspacey commented Mar 3, 2015

emmanuelbuah commented Mar 9, 2015

emmanuelbuah commented Mar 9, 2015

Slava commented Aug 2, 2015

sean-stanley commented Sep 15, 2016