You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now, the safekeeper writes the full partial wal segments to S3 each time it does an upload. This has unfortunate interactions with S3's versioning support which stores old versions of these partial segments.
Assuming a linear amount of WAL traffic, we upload half of our WAL segment size on average each time there is any new content at all. In the worst case, there is small but nonzero number of bytes of new WAL every 10 minutes, and we write the same partial segment over and over again. A continuously written amount of 64 bytes every 10 minutes turns into an upload of 8 MB every 10 minutes, or a blowup factor of 125 thousand. We then store these 8 MB files as noncurrent versions for 30 days.
There is several possible ways to deal with this:
turning off versioning. versioning can only be turned off on a per-bucket basis, so we'd need another separate bucket for those partial wal segments. This harms our DR abilities however, and violates our internal policies. So probably not going to happen
versioning but keep the versions for less amount of time. This can be done on a per-tag basis. The blowup is still there though, and if we wanted to reduce the interval from 10 minutes to say 1 minute or less, the blowup would scale accordingly. It also hurts the DR case.
while azure has append blobs, S3 doesn't have an analog. It has multipart uploads which don't have an expiry, so you can start them and assemble them piece by piece. But the minimum size of a part is 5 MB, so we wouldn't gain much.
Split up partial WAL segments into parts of their own. This would result in less overwrites.
I think the last alternative is the best way to address this. There are multiple possible strategies, exposing a tradeoff between number of operations, number of overwrites, and max number of files stored in the timeline prefix:
at each 10 minute interval, just write the delta into a new object, if there is additional data, and nothing if there is none. If there is disagreement among safekeeper nodes what the last state is, the one with the newest data can just write yet another file, or overwrite the existing file. This means a potentially unlimited number of files, but it has the minimum number of overwrites
coalesce data into smaller blocks of 1 KB size, and if there is new data but it's not enough to create a new block, just append to the existing one and upload the entire new block. Then the blowup can be at most 500, and there is at most 16 million files
same as before, but with a background job that groups together (completed) blocks into larger blocks. We can do this in an exponential fashion, so for each doubling in size we only keep one copy until we reach the segment size. Then, the number of overwrites introduced by this scheme is limited to log(n), and the number of files is also limited to log(n)
a combination of the first approach with a background coalescing job.
We can also think about reducing the segment size as the blowup factor is related to that. The same time, we want the system to scale to TB of WAL so not sure we want even more files in the directory.
Right now, the safekeeper writes the full partial wal segments to S3 each time it does an upload. This has unfortunate interactions with S3's versioning support which stores old versions of these partial segments.
Assuming a linear amount of WAL traffic, we upload half of our WAL segment size on average each time there is any new content at all. In the worst case, there is small but nonzero number of bytes of new WAL every 10 minutes, and we write the same partial segment over and over again. A continuously written amount of 64 bytes every 10 minutes turns into an upload of 8 MB every 10 minutes, or a blowup factor of 125 thousand. We then store these 8 MB files as noncurrent versions for 30 days.
There is several possible ways to deal with this:
I think the last alternative is the best way to address this. There are multiple possible strategies, exposing a tradeoff between number of operations, number of overwrites, and max number of files stored in the timeline prefix:
log(n)
, and the number of files is also limited tolog(n)
We can also think about reducing the segment size as the blowup factor is related to that. The same time, we want the system to scale to TB of WAL so not sure we want even more files in the directory.
related slack thread.
The text was updated successfully, but these errors were encountered: