storage: commit to some ordering between partitions in UPSERT #26965

guswynn · 2024-05-07T19:18:59Z

A key changing the partition (right now, specifically Kafka partitions) is considered not supported by UPSERT sources, for good reason: there is no defined order between partitions, so the order that updates occur at is undefined.

Previous behavior

Before #24663, if you DID create an UPSERT sources where a key's partition changed, we had the following behavior:

If the updates occur within the same mz time (i.e., they are reclocked the same) then choose the one with the higher partition.
If the updates occur in different mz times, choose the later one.

This is a (afaiui) a non-definite Collection; depending on how the exact same upstream data is reclocked, we could end up with collections the accumulates differently. However, once the data is reclocked, things work out fine, and we are able to resume an UPSERT ingestion correctly

Current behavior

After #24663, we now always choose the update from the larger partition, with one caveat: If we are resuming an ingestion, updates after the resumption frontier are chosen over anything

This is definite in the steady-state, but resumption can cause arbitrary orders, depending on the exact moment of resumption.

Committing to a decision

I think we should commit the current behavior; while it is not strictly correct, for kafka sources (of kafka-like sources) where a key's partition CAN change, but only ever go increase, it is the ideal behavior.

The text was updated successfully, but these errors were encountered:

petrosagg · 2024-05-20T14:53:51Z

This is definite in the steady-state, but resumption can cause arbitrary orders, depending on the exact moment of resumption.

Noting it here too. This is a bug, there should be no discrepancies in the produced collection no matter how many restart points you insert. We need to store the mz timestamp in the upsert state and use that too when comparing updates

guswynn · 2024-05-20T18:08:11Z

@petrosagg for keys whose partitions are only-increasing, we ARE definite; handling keys whose partitions can go down and handling resumptions requires that we hold the partition of each message in upsert state, which can't be done (currently) unless the partition is part of the output relation, because during hydration we only have the output shard

petrosagg · 2024-05-21T20:44:10Z

for keys whose partitions are only-increasing, we ARE definite;

Indeed. I didn't claim there aren't cases we we are definite. We're also definite when we process an empty topic /s But in seriousness, given that there is a way to fix this, we should.

handling resumptions requires that we hold the partition of each message in upsert state,

I don't think so. The timestamp associated with each key in upsert state should be the tuple (mz_timestamp, partition, offset). This is because the (partition, offset) pair is only needed to disambiguate between updates that happen at the same mz timestamp, and so on rehydration we can initialize the upsert state with (resume_ts, 0, 0) without problems because we will only ever process data for timestamps greater than resume_ts.

guswynn linked a pull request May 14, 2024 that will close this issue

storage: add regression test for upsert sources with keys that move partitions #27092

Open

5 tasks

bosconi added the A-storage Area: storage label Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: commit to some ordering between partitions in UPSERT #26965

storage: commit to some ordering between partitions in UPSERT #26965

guswynn commented May 7, 2024

petrosagg commented May 20, 2024

guswynn commented May 20, 2024

petrosagg commented May 21, 2024

storage: commit to some ordering between partitions in UPSERT #26965

storage: commit to some ordering between partitions in UPSERT #26965

Comments

guswynn commented May 7, 2024

Previous behavior

Current behavior

Committing to a decision

petrosagg commented May 20, 2024

guswynn commented May 20, 2024

petrosagg commented May 21, 2024