Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

safekeeper: address scaling concerns around partial segment uploads #7732

Open
arpad-m opened this issue May 13, 2024 · 0 comments
Open

safekeeper: address scaling concerns around partial segment uploads #7732

arpad-m opened this issue May 13, 2024 · 0 comments
Labels
a/tech_debt Area: related to tech debt c/storage/safekeeper Component: storage: safekeeper

Comments

@arpad-m
Copy link
Member

arpad-m commented May 13, 2024

Right now, the safekeeper writes the full partial wal segments to S3 each time it does an upload. This has unfortunate interactions with S3's versioning support which stores old versions of these partial segments.

Assuming a linear amount of WAL traffic, we upload half of our WAL segment size on average each time there is any new content at all. In the worst case, there is small but nonzero number of bytes of new WAL every 10 minutes, and we write the same partial segment over and over again. A continuously written amount of 64 bytes every 10 minutes turns into an upload of 8 MB every 10 minutes, or a blowup factor of 125 thousand. We then store these 8 MB files as noncurrent versions for 30 days.

There is several possible ways to deal with this:

  • turning off versioning. versioning can only be turned off on a per-bucket basis, so we'd need another separate bucket for those partial wal segments. This harms our DR abilities however, and violates our internal policies. So probably not going to happen
  • versioning but keep the versions for less amount of time. This can be done on a per-tag basis. The blowup is still there though, and if we wanted to reduce the interval from 10 minutes to say 1 minute or less, the blowup would scale accordingly. It also hurts the DR case.
  • while azure has append blobs, S3 doesn't have an analog. It has multipart uploads which don't have an expiry, so you can start them and assemble them piece by piece. But the minimum size of a part is 5 MB, so we wouldn't gain much.
  • Split up partial WAL segments into parts of their own. This would result in less overwrites.

I think the last alternative is the best way to address this. There are multiple possible strategies, exposing a tradeoff between number of operations, number of overwrites, and max number of files stored in the timeline prefix:

  • at each 10 minute interval, just write the delta into a new object, if there is additional data, and nothing if there is none. If there is disagreement among safekeeper nodes what the last state is, the one with the newest data can just write yet another file, or overwrite the existing file. This means a potentially unlimited number of files, but it has the minimum number of overwrites
  • coalesce data into smaller blocks of 1 KB size, and if there is new data but it's not enough to create a new block, just append to the existing one and upload the entire new block. Then the blowup can be at most 500, and there is at most 16 million files
  • same as before, but with a background job that groups together (completed) blocks into larger blocks. We can do this in an exponential fashion, so for each doubling in size we only keep one copy until we reach the segment size. Then, the number of overwrites introduced by this scheme is limited to log(n), and the number of files is also limited to log(n)
  • a combination of the first approach with a background coalescing job.

We can also think about reducing the segment size as the blowup factor is related to that. The same time, we want the system to scale to TB of WAL so not sure we want even more files in the directory.

related slack thread.

@arpad-m arpad-m added the c/storage/safekeeper Component: storage: safekeeper label May 13, 2024
@jcsp jcsp changed the title address partial wal scaling safekeeper: address scaling concerns around partial segment uploads May 14, 2024
@jcsp jcsp added the a/tech_debt Area: related to tech debt label May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/safekeeper Component: storage: safekeeper
Projects
None yet
Development

No branches or pull requests

2 participants