Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Math.random to prevent duplicate event IDs #499

Open
christoph-buente opened this issue Jun 24, 2016 · 30 comments
Open

Replace Math.random to prevent duplicate event IDs #499

christoph-buente opened this issue Jun 24, 2016 · 30 comments
Labels
category:core About the shared core library. priority:high To fix as soon as possible. status:needs_triage Needs maintainer triage. type:defect Bugs or weaknesses. The issue has to contain steps to reproduce.
Milestone

Comments

@christoph-buente
Copy link

christoph-buente commented Jun 24, 2016

According to the definition of a V4 UUID, event_id is supposed to be unique for every event. Unfortunately we see the same event_id more than once in our collected data.

Our example data set has 13,406,731 events from which 13,235,475 have unique event_ids. This means that there are about 1.3% events that have duplicated event_id. These events may come in completely different times and from different ip's. As an example, there are 15 different events with event_id = "f2d4f27d-077a-4086-8cce-5789b10fbfda". See the graphics below.

pasted image at 2016_06_24 12_10 pm

We even have a larger data set with more recent JS tracker versions (2.5.x and 2.6.x), and the duplicate ratio is even worse ~= 1.8%

@alexanderdean
Copy link
Member

Are you running the event fingerprint enrichment? Could you reshare the above table with the event fingerprint output as well? Trying to understand what type of duplicates these are.

@christoph-buente
Copy link
Author

christoph-buente commented Jun 24, 2016

For this dataset, which is a bit older, we did not have that enrichment enabled. (If it even was available back then.) I'll check for the current SP stack, if we have it running.

But hashing the whole event does not make a difference. If time, IP address and user agent are different. It will most definitively result in a different hash!

@alexanderdean
Copy link
Member

Thanks @christoph-buente !

@christoph-buente
Copy link
Author

Looking at the user agent i came to the suspicion that the actual implementation of the uuid algorithm falls back to Math.random() in certain browsers. Can this be confirmed? That would be a bug in the underlying uuid npm module, right?

@bogaert
Copy link

bogaert commented Jul 5, 2016

Hi @christoph-buente. This is an issue we have written about before: http://discourse.snowplowanalytics.com/t/de-deduplicating-events-in-hadoop-and-redshift-tutorial/248 (in case the solutions outlined there are useful).

@christoph-buente
Copy link
Author

@bogaert thx Chris, i read that article. But i hoped there is a chance to fix that issue instead of working around it later down the data pipeline.

@bogaert
Copy link

bogaert commented Jul 5, 2016

@christoph-buente That makes sense. We'll look into it (cc @jbeemster).

@christoph-buente
Copy link
Author

Bump 😄

@bogaert
Copy link

bogaert commented Nov 18, 2016

Hi @christoph-buente. It was good meeting in Berlin! This is something we'll keep investigating, but in the meantime we're also working on an improved deduplication step in the pipeline: https://github.com/snowplow/snowplow/issues?q=is%3Aopen+is%3Aissue+milestone%3A%22R8x+%5BHAD%5D+Synthetic+dedupe%22

@alexanderdean
Copy link
Member

See snowplow/snowplow#2967

@alexanderdean alexanderdean changed the title Duplicate event_id for apperently different events Replace Math.random to prevent duplicate event IDs Nov 30, 2016
@alexanderdean
Copy link
Member

From @falschparker82:

Hi Snowplowers,

we've got heavy problems here due to some bots (especially Googlebot, googleweblight and others) apparently using a broken version of Math.random(), leading to duplicated event_ids. This gets especially annoying with the resulting huge cartesian joins resulting from significant context use.

I could track down the problem down into the uuid library of the implementation that the Javascript collector uses: https://github.com/kelektiv/node-uuid/blob/master/lib/rng-browser.js#L17 - which happens to fall eventually back down to Math.random() if no better randomness sources are available.

Unfortunately, the implementation of Math.random() is left to the JS interpreter, which is broken on several systems - sources:

http://stackoverflow.com/a/24224089/1281376
https://medium.com/@betable/tifu-by-using-math-random-f1c308c4fd9d#.keeelkt8v

I'd like to propose the following solution for this problem:

  1. Replace Math.random() with a seedable Mersenne twister: https://github.com/pigulla/mersennetwister (probably needs upstream patching of npm uuid package). While this is not suitable for cryptographic uses, all that's needed is good entropy here to prevent event ID duplication. Also, the cycle length of >2^19000 far surpasses the capacity of a UUIDv4 (2^128).

1a) Make it possible to seed the RNG by invoking sp.js with an additional parameter &rnd_seed=a4b120f48... (if the page isn't cached or served via CDN, this should be possible rather easily with templating from multiple languages) - which then gets inserted by a seed vector into the twister.

1b) Alternatively - or as a fallback to 1a) - generate the seed vector from the following entropy sources:

  • Hash of URL value
  • getHour(),getMinutes(),getSeconds(),getMilliseconds()
  • Additionally generated floating point entropy: https://github.com/keybase/more-entropy
  • Difference of Snowplow cookie date to current time
  • Injecting Math.random() at least can't hurt

Thoughts?

Right now I've got a little too much on my plate, so feel free to grab it - but if we happen to do ourselves eventually it we'd love to upstream in case you're interested.

@alexanderdean alexanderdean added this to the Version 2.8.0 milestone Nov 30, 2016
@alexanderdean alexanderdean added type:defect Bugs or weaknesses. The issue has to contain steps to reproduce. priority:high To fix as soon as possible. labels Nov 30, 2016
@alexanderdean
Copy link
Member

Scheduling into 2.8.0 as this is important

@falschparker82
Copy link

re-hi!

(was a little confused at first since the issue link was cross-project, so #499 ended up linking to the wrong issue on snowplow/snowplow).

As said, so much for the thoughts, can't do it myself right now due to lack of time unfortunately. Just a quick clarification regarding:

I like 1b) - I have nothing against 1a) but I don't know of many (any?) users who don't serve sp.js via CDN...

I was actually thinking about templating the tag itself when rendering a web page, not the Javascript tracker. So more something along the lines of (thinking in PHP here for simplicity / demonstration purposes):

<script type="text/javascript">
    ;(function(p,l,o,w,i,n,g,r){if(!p[i]){p.GlobalSnowplowNamespace=p.GlobalSnowplowNamespace||[];
    p.GlobalSnowplowNamespace.push(i);p[i]=function(){(p[i].q=p[i].q||[]).push(arguments)
    };p[i].q=p[i].q||[];n=l.createElement(o);g=l.getElementsByTagName(o)[0];n.async=1;
    n.src=w;g.parentNode.insertBefore(n,g)}}(window,document,"script","https://static.justwatch.com/static/compile_jw/assets/sp-2.6.2.js","snowplow","<?php uniqid(); ?>"));
    </script>

@falschparker82
Copy link

Addendum: I'd be happy to review though - a few implementation hints:

  • First off, definitely needs to initialize the RNG via a larger vector/array, otherwise it's the same limited state problem all over again when using a single int
  • As the floating point entropy generation package seems pretty taxing on the CPU (especially on mobile) and thus increasing tracker latency, I'd go for not much more than 16 bits of entropy and use it only when needed (which is hard to prove)
  • I'd be a super happy camper if the fully initialized RNG or UUID functions could be somehow exposed from the module, as we've got some more things we need UUIDv4's, for example survey IDs (where we've got the same problem). Spares us initializing two different RNGs for doing the same.

@falschparker82
Copy link

Any update on this one? Can I help?

@alexanderdean
Copy link
Member

Hey @falschparker82 - we pushed it back because we are assigning a new (to Snowplow trackers) engineer to the 2.8.0 release and this is a complex ticket. To bring it back into 2.8.0 again, I think we would need a PR from you...

@falschparker82
Copy link

Unfortunately, we're super low on Frontend power right now and JavaScript isn't exactly my hometurf. I'll see if we can make a slot for that somewhere in the next weeks but no high hopes right now. Will touch base in a month at the latest.

@christoph-buente
Copy link
Author

@alexanderdean do you have capacity now, to tackle that issue? It did not dissapear, unfortunately. 😭

@alexanderdean
Copy link
Member

Potentially @christoph-buente - we should know within the next month how much bandwidth we have for JS Tracker work...

@g13013
Copy link

g13013 commented Mar 29, 2018

Much needed, any news on this issue?

@anagytherealone
Copy link

anagytherealone commented Jun 1, 2018

For the record I've put a comment on the uuid package this repo uses to be able to create more secure id-s: uuidjs/uuid#173

@alexanderdean
Copy link
Member

Thanks @anagytherealone !

@smalory
Copy link

smalory commented Feb 7, 2020

Is there any more news on this issue? What's the plan? How are things progressing?

We've noticed the same thing- clashes in snowplow-generated ids. We've also seen clashes across ids: e.g. event_ids clashing with session_ids.

There seems to be a general feeling here that with a good enough random number generator you'll get an essentially unique UUID. While this is true when using a single stream of random numbers it is not true when you have several streams of random numbers not sharing state. Unfortunately, since there are many, many trackers running in parallel, any guarantees on the uniqueness of the ids falls over and is at the whim of how the seed is determined.

Bloating the codebase with a more sophisticated random number generator is overkill. Random number generators like the Mersenne Prime Twister are designed to be robust to a whole suite of statistical tests; the goal being that the generated random numbers will be used in simulations. For UUID, all that's needed is a long periodicity; even a basic random number generator would suffice.

In my opinion, the correct approach is to simply hash a whole bunch of context relating to the event, user, session etc. and also inject some randomness through a random number generator. Then, the only way the ids would be is if the random number clashed and the entirety of the context clashed. The chances of this are essentially 0.

@paulboocock
Copy link
Contributor

Thanks for your comments @smalory
This is certainly on our radar still. We have a considerable release on the horizon in regards to v3. We're currently in the process of triaging open issues and identifying what we want to think about changing for v3 and I think this issue might make the cut in some shape or form.
Keep your eye on it and our milestones as we will be chopping and changing the milestones over the coming weeks as decide what our near term roadmap starts looking like.

@paulboocock paulboocock added category:core About the shared core library. status:needs_triage Needs maintainer triage. labels Feb 7, 2020
@smalory
Copy link

smalory commented Feb 7, 2020

Great; thanks for the info Paul!

@patmmccann
Copy link

I am investigating a similar issue. Has anyone tested if crypto.getRandomValues() works better for googleweblight or the googlebot? We also see an issue when the user agent is simply 'Mozilla 5.0' or contains adsbot

@paulboocock
Copy link
Contributor

Time for an update on this one given v3 is now available. Unfortunately little has changed at the moment, the issue arises due to falling back to Math.random() when crypto.getRandomValues() isn't available. As we continued to support older browsers with v3, this fallback has stayed in place. Until we drop support for IE9 and IE10 then we can't do too much to fix this (unless anyone has any ideas that don't add too much bloat?).

As for dropping IE9 and IE10, they have a depressingly high level of events (particularly IE9) in the data we assessed. I'd hope we can drop them when v3 is mature and move to v4 - those that want IE9 and IE10 support can then stay on the mature v3.

For the time being, deduplicating the events downstream is the best solution for this (as described earlier in this issue).

@schmkr
Copy link

schmkr commented Sep 22, 2022

Hi @paulboocock,

In light of your last comment from almost 1.5 years ago, is there nowadays still a need/requirement to support IE9 and IE10 (and even IE11)?

Are there already plans for the mentioned v4?

Alternatively, would it be an idea to offer the "older browsers" support as a separate NPM package that can be installed like the plugins when people have a need to support those older browsers?

@AlexBenny
Copy link

Hi @schmkr, thanks for raising this question.

Are there already plans for the mentioned v4?

Yes. We have plans for the v4. Indicatively we plan to release the v4 before the end of the year.
That version will drop the support to IE10. Also we will take care to reduce the issues related to Math.random.

@AlexBenny AlexBenny added this to the Version 4.0.0 milestone Sep 22, 2022
@davidher-mann
Copy link

Hi @AlexBenny, since Version 3.9 was just released: are you still sticking to the plan that the Math.random issue will be addressed in Version 4.0 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category:core About the shared core library. priority:high To fix as soon as possible. status:needs_triage Needs maintainer triage. type:defect Bugs or weaknesses. The issue has to contain steps to reproduce.
Projects
None yet
Development

No branches or pull requests