Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

big collections redux #14

Open
javaknight opened this issue Sep 8, 2014 · 21 comments
Open

big collections redux #14

javaknight opened this issue Sep 8, 2014 · 21 comments

Comments

@javaknight
Copy link

I need this to watch a big collection, but I don't want to actually publish the big collection to the client. I just need the count of the collection to be published, and if that count changes, then I need the change to be reactive and sent to the client.

Currently when I fire this up on a large collection, it just crashes my server after showing me the initial count.

@dburles
Copy link
Contributor

dburles commented Sep 9, 2014

hey @javaknight do you have any more info on the crash? and how many records are in the collection?

@tmeasday
Copy link
Member

If the collection is truly large, it might be better to do something a little different:

Meteor.publish('count', function() {
  var self = this, first = true;
  var count = function() {
    var thisCount = Collection.find().count();
    if (first) {
      self.added('counts', 'X', {count: thisCount});
    } else {
      self.changed('counts', 'X', {count: thisCount});
    }
    first = false;
  }
  var timeout = Meteor.setInterval(count, 1000); // every 1s
  count();
  self.ready();

  self.onStop(function() {
    Meteor.clearTimeout(timeout)
  });
});

Of course ideally you'd share the timeout between multiple users subbing to the same publication. Sounds like a whole package of it's own :)

@tmeasday
Copy link
Member

(This package pulls down an caches the _id from every record. If there are a lot of them, this is a terrible idea. But it allows it to be truly realtime).

@chhib
Copy link

chhib commented Sep 17, 2014

@tmeasday: I think you mean Meteor.setInterval instead of Meteor.setTimeout.

@tmeasday
Copy link
Member

Ahh, thanks @chhib - I updated the code so people aren't confused.

@colllin
Copy link

colllin commented Nov 24, 2014

@tmeasday Why is it necessary to cache the _id from every record? Compared to just incrementing on added and decrementing on removed?

@tmeasday
Copy link
Member

@colllin if you are talking about livedata's added and removed--well it'll need to cache the _id to work properly anyway. The underlying reason is basically timing issues on the oplog -- if the server sees an oplog inserted message it needs to check that it hasn't already counted that document (thus the cached _id) otherwise there are edge cases in which double counting could happen.

That's my understanding of it anyway. Possibly someone could figure out a way to use the low-level oplog driver and deal with these issues, not sure.

@colllin
Copy link

colllin commented Nov 25, 2014

@tmeasday Yes, that's what I was talking about. I didn't realize observe()ing added and removed documents was imperfect (could send duplicate events)... interesting. Thank you for the explanation.

@tmeasday
Copy link
Member

To be clear it's the oplog that is imperfect (I think there are a bunch of issues around the exact timing of doing your initial query vs where you start observing the oplog from).

.added() and .removed() in livedata are "perfect", but have the aforementioned performance caveat (you don't want to do them on a huge cursor).

@jchristman
Copy link

@javaknight, I had this same problem because my collection at 100,000+ rows - I am implementing a "scrollbox" that loads a sliding window over a collection to emulate the browser loading the entire collection. I implemented the solution @tmeasday posted above at https://github.com/jchristman/meteor-collection-scroller/blob/master/lib/collections.js if you wanna check it out (also at http://scroller.meteor.com). Atmosphere Link

@faceyspacey
Copy link

@tmeasday regarding your setInterval example, instead couldn't you just make the observer based on a cursor that finds a limit of one row, sorted by newest to oldest, and then increment the count only when needed rather than by interval. And of course call Collection.find().count() just once at the beginning. And then set the removed observer as usual. You'd just need to accept the collection as an argument instead of a cursor, perhaps a collection plus selector plus dateColumn.

Counts.publish = function(self, name, collection, selector, dateColumn, options) {
     var sort = {};
     sort[dateColumn] = -1;
     var count = collection.find(selector, {sort: sort, limit: 1}).count();

     var observers = {
      added: function(id, fields) {
        count += 1;
        if(!initializing) self.changed('counts', name, { count: count });
      },
      removed: function(id, fields) {
        count -= 1;
        self.changed('counts', name, { count: count });
      }
    };

    //etc
};

@tmeasday
Copy link
Member

tmeasday commented Mar 3, 2015

@faceyspacey - Seems like a good idea for collections where you do have a date field to work with.

I'm not sure that the removed will work however? What if I remove a document that isn't the latest?

@faceyspacey
Copy link

is there really no way for meteor's observers to skip calling all the added handlers on first run. Like a way internal to how meteor's observeChange's work. It seems everyone is doing the !initializing thing. ...I guess another cursor without a limit could be created just for the removed observer. Collection.remove could be overwritten to somehow notify this code--obviously that won't address direct changes to the mongo collection outside meteor code. The first solution seems fine to me. Whatchu think?

@tmeasday
Copy link
Member

tmeasday commented Mar 3, 2015

  1. Well you'll have the problem of a huge cursor again. Which means an unacceptably large data set cached on the server.
  2. Nope, doing anything off the oplog isn't going to work if you horizontally scale.

@faceyspacey
Copy link

then i guess overwriting collection.remove is the only answer, coupled with a rest API endpoint to ping if you remove rows outside of meteor. You just have to ping that API every time you directly remove rows. For me--and I'm willing to bet the vast majority of meteor developers--we wouldn't even need that. Maybe just a simple reset() method to call from time to time.

@faceyspacey
Copy link

so I guess collection.remove would store in another collection the name of the collection (only if a count was published for the collection). No more than the collection name would need to be stored. And then in Count.publish we just observe this collection for new added documents (selecting only documents that have the appropriate collection name), and when found decrement the count. We would also remove the row from this auxiliary collection after we decrement the count so it too is not very large (never more than one row lol).

@tmeasday
Copy link
Member

tmeasday commented Mar 3, 2015

@faceyspacey if you are going to think about wacky solutions like this, I'd suggest just denormalizing the count somewhere.

@faceyspacey
Copy link

well then just resetting the count on removes would be the solution. using a counts collection with the count from one publication denormalized into one row there.

@emmanuelbuah
Copy link

@tmeasday using setInterval can also be improved by keeping track of the previous count and only sending data to the client if the current (thisCount) is different from the previous count.

@emmanuelbuah
Copy link

Knowing the current limitation of oplog in combination with the exisitng observer api, I think the best solution for scale is to compute and store counts (in a mongodb collection or in the relevant doc) on insert and remove. Ex. On adding or removing comments from a post, update comments counter storage (possible on post, ie. post.commentsCount). This might look silly but works and scale very well.

@Slava
Copy link

Slava commented Aug 2, 2015

I think we could make this better if we do the following steps:

  • Separate Oplog-tailing from Mongo-LiveQuery (the one that keeps the working set in memory)
  • Use the new API to stream the data through publish-counts but not store any cache
  • Implement a probabilistic data structure like HyperLogLog that can calculate set cardinality w/o storing the whole set in memory
  • Implement the naive approach with the fallback to HLL on big numbers using cursor#count method.

@sean-stanley
Copy link

Just one point I'm a bit unclear on. If I have a large collection but only want to count a small number of these (like unread notifications for a particular online user not all notifications for all users). Then I am only caching the documents in the cursor right not the entire collection so this package would work very well for counting small numbers of things.

However I suppose if I had 500 online users each with only 10 unread notifications I'd still be caching 5000 documents right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants