Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KSUIDs do not sort properly. #77

Open
ShawnMilo opened this issue Jul 20, 2023 · 1 comment
Open

KSUIDs do not sort properly. #77

ShawnMilo opened this issue Jul 20, 2023 · 1 comment

Comments

@ShawnMilo
Copy link

On the Linux command line and in PostgreSQL, ksuids do not sort properly.

  1. 2SmasRGkif9qETAHwssB4GGYwdi can be converted to the timestamp 2023-07-19 04:13:36 -0400 EDT
  2. 2STcKMQ7MGgoDxpX9A5WuaJlx8V can be converted to the timestamp 2023-07-12 10:59:06 -0400 EDT

However, PostgreSQL and the Linux sort command both sort 2SmasRGkif9qETAHwssB4GGYwdi lower than 2STcKMQ7MGgoDxpX9A5WuaJlx8V, because evidently, all lower-case letters sort before capital letters.

I'm assuming that the ksuid algorithm uses the logic that capital letters have a lower ASCII value:

Python 3.11.2 (main, May 30 2023, 17:45:26) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> ord('A')
65
>>> ord('a')
97
>>> 

I love the project and use ksuids as much as possible, but this issue has bitten me at work and been the cause of some subtle bugs.
Is it impossible to fix in a backwards-compatible way?

@fabiolimace
Copy link
Contributor

fabiolimace commented Jul 20, 2023

There is an interesting Gist document related the problem: PostgreSQL collation is a massive footgun.

Also check this Gist comment: Functions for generating Segment's KSUIDs on PostgreSQL. They had to change the collation on their databases from en_US.UTF-8 to C.UTF-8.

The README says that "running a set of KSUIDs through the UNIX sort command will result in a list ordered by generation time". However, the UNIX sort behaviour is affected by the locale variable, according to the sort man page:

*** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.

I think adding a sentence in the README that specifies the correct collation can avoid this pitfall. Perhaps a warning is more effective, like the one above from the sort manual page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants