Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Javascript Tracker: Duplicated event IDs from broken Math.random, especially bots #2967

Closed
falschparker82 opened this issue Nov 29, 2016 · 2 comments

Comments

@falschparker82
Copy link

Hi Snowplowers,

we've got heavy problems here due to some bots (especially Googlebot, googleweblight and others) apparently using a broken version of Math.random(), leading to duplicated event_ids. This gets especially annoying with the resulting huge cartesian joins resulting from significant context use.

I could track down the problem down into the uuid library of the implementation that the Javascript collector uses: https://github.com/kelektiv/node-uuid/blob/master/lib/rng-browser.js#L17 - which happens to fall eventually back down to Math.random() if no better randomness sources are available.

Unfortunately, the implementation of Math.random() is left to the JS interpreter, which is broken on several systems - sources:

http://stackoverflow.com/a/24224089/1281376
https://medium.com/@betable/tifu-by-using-math-random-f1c308c4fd9d#.keeelkt8v

I'd like to propose the following solution for this problem:

  1. Replace Math.random() with a seedable Mersenne twister: https://github.com/pigulla/mersennetwister (probably needs upstream patching of npm uuid package). While this is not suitable for cryptographic uses, all that's needed is good entropy here to prevent event ID duplication. Also, the cycle length of >2^19000 far surpasses the capacity of a UUIDv4 (2^128).

1a) Make it possible to seed the RNG by invoking sp.js with an additional parameter &rnd_seed=a4b120f48... (if the page isn't cached or served via CDN, this should be possible rather easily with templating from multiple languages) - which then gets inserted by a seed vector into the twister.

1b) Alternatively - or as a fallback to 1a) - generate the seed vector from the following entropy sources:

  • Hash of URL value
  • getHour(),getMinutes(),getSeconds(),getMilliseconds()
  • Additionally generated floating point entropy: https://github.com/keybase/more-entropy
  • Difference of Snowplow cookie date to current time
  • Injecting Math.random() at least can't hurt

Thoughts?

Right now I've got a little too much on my plate, so feel free to grab it - but if we happen to do ourselves eventually it we'd love to upstream in case you're interested.

@alexanderdean
Copy link
Member

Hey @falschparker82 - many thanks for the super-detailed and thoughtful ticket.

I like 1b) - I have nothing against 1a) but I don't know of many (any?) users who don't serve sp.js via CDN...

Thoughts from the community?

@chuwy chuwy changed the title Javascript Collector: Duplicated event IDs from broken Math.random, especially bots Javascript Tracker: Duplicated event IDs from broken Math.random, especially bots Nov 30, 2016
@alexanderdean
Copy link
Member

alexanderdean commented Nov 30, 2016

Closing - have copied @falschparker82's great post into snowplow/snowplow-javascript-tracker#499...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants