Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Serialisation/Deserialisation #11

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

domhauton
Copy link

In order to aggregate tdigests from multiple services it's necessary to serialise and deserialise the data to be sent.

This change implements the serialisation from AVLTree in tdunning's original implementation with some added safety features.

Both libraries can now process and swap data using the small and large encoding types.

https://github.com/tdunning/t-digest/blob/e8ca8479b7c98deb4bfe381a4e465e33e7063262/core/src/main/java/com/tdunning/math/stats/AVLTreeDigest.java#L340

The raw data used to generate the test data with tdunning's library is provided.

@welch
Copy link
Owner

welch commented Dec 10, 2018

very cool, I will have some time later this week to look this over.
serial interop with another library is very appealing.

as with the previous PR, I'm not a fan of defending against bad input
when doing so embeds new assumptions, so the overflow protection
will be a hard sell.

@domhauton
Copy link
Author

I have changed the assumptions to be based on the input buffer length. This will always be true unless the data is corrupted, in which case the protection is valid and it requires no extra configuration from the user.

It's valuable to have this to prevent OOM problems and wasted compute on receiving corrupted data. It's fairly likely to happen within t-digest because of the different encoding between histogram types. If you put an encoded MergingDigest into the AVLTreeDigest decoder it reads an incorrect (usually very high) centroid count and proceed to fail or OOM. If collecting digests from thousands of services written in varying languages it makes this kind of mistake very likely.

It would be great to save people from production outages due to crossed wires if we can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants