Change `versions` table layout for performance #1457

fatkodima · 2024-01-27T16:57:47Z

We are currently using paper_trail and have billions of items in the versions table and the table is huge.

The one easy improvement I noticed is the versions table layout. Currently, its layout is not optimal and will cause unnecessary fragmentation inside the table. There are good articles on the theme like one and two. Basically, we need to have fields with static sizes first in the table packed in a way to reduce paddings.

With the currently implemented layout, if we consider that the user decides to use bigint for whodunnit (see #1456), then whodunnit, item_id and created_at should be positioned on 8 bytes boundaries (because each of them have 8 bytes in size) and the fields that precede them can have a padding added at the end for this to happen. This can be as much as 7 bytes of padding for each field.

For example, if we have 4 billions of records in the database and each row has a 21 byte of wasteful padding, then we can save 4 * 10^9 * 21 / 10^9 ~ 100Gb 🔥 of memory by just doing this simple table layout change.

Also, afaik, postgres precalculates padding for columns in the row for statically sized columns (for prefix of the columns with static types) and than can easily jump to specific columns using that offsets when reading the row. Instead of manually traversing the row with dynamic column sizes to get to the needed column. So, this will also speedup the reading of whodunnit, item_id and created_at columns.

I believe, this will improve the situation for MySQL too.

Wrote good commit messages.
Feature branch is up-to-date with master (if not - rebase it).
Squashed related commits together.
Added tests.
Added an entry to the Changelog if the new
code introduces user-observable changes.
The PR relates to only one subject with a clear title
and description in grammatically correct, complete sentences.

jonatas · 2024-04-12T20:02:40Z

Hey @fatkodima , that looks so cool! Have you checked adding timescaledb to also partition the data by time? It would have a massive storage gains using compression with dictionary algorithms over all the repeated values.

Happy to have a chat and introduce it.

Change versions table layout for performance

00827b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change `versions` table layout for performance #1457

Change `versions` table layout for performance #1457

fatkodima commented Jan 27, 2024

jonatas commented Apr 12, 2024 •

edited

Change versions table layout for performance #1457

Are you sure you want to change the base?

Change versions table layout for performance #1457

Conversation

fatkodima commented Jan 27, 2024

jonatas commented Apr 12, 2024 • edited

Change `versions` table layout for performance #1457

Change `versions` table layout for performance #1457

jonatas commented Apr 12, 2024 •

edited