Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change versions table layout for performance #1457

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fatkodima
Copy link

We are currently using paper_trail and have billions of items in the versions table and the table is huge.

The one easy improvement I noticed is the versions table layout. Currently, its layout is not optimal and will cause unnecessary fragmentation inside the table. There are good articles on the theme like one and two. Basically, we need to have fields with static sizes first in the table packed in a way to reduce paddings.

With the currently implemented layout, if we consider that the user decides to use bigint for whodunnit (see #1456), then whodunnit, item_id and created_at should be positioned on 8 bytes boundaries (because each of them have 8 bytes in size) and the fields that precede them can have a padding added at the end for this to happen. This can be as much as 7 bytes of padding for each field.

For example, if we have 4 billions of records in the database and each row has a 21 byte of wasteful padding, then we can save 4 * 10^9 * 21 / 10^9 ~ 100Gb 馃敟 of memory by just doing this simple table layout change.

Also, afaik, postgres precalculates padding for columns in the row for statically sized columns (for prefix of the columns with static types) and than can easily jump to specific columns using that offsets when reading the row. Instead of manually traversing the row with dynamic column sizes to get to the needed column. So, this will also speedup the reading of whodunnit, item_id and created_at columns.

I believe, this will improve the situation for MySQL too.

  • Wrote good commit messages.
  • Feature branch is up-to-date with master (if not - rebase it).
  • Squashed related commits together.
  • Added tests.
  • Added an entry to the Changelog if the new
    code introduces user-observable changes.
  • The PR relates to only one subject with a clear title
    and description in grammatically correct, complete sentences.

@jonatas
Copy link

jonatas commented Apr 12, 2024

Hey @fatkodima , that looks so cool! Have you checked adding timescaledb to also partition the data by time? It would have a massive storage gains using compression with dictionary algorithms over all the repeated values.

Happy to have a chat and introduce it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants