Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dolt diff does not display added and corresponding deleted rows next to each other #7808

Open
liuliu-dev opened this issue May 1, 2024 · 4 comments
Labels

Comments

@liuliu-dev
Copy link
Contributor

this affects how we display diff rows on dolthub

repro:

  • clone this db, checkout a new branch.
  • change pk2 in 2 rows:
UPDATE `change_in_pks` SET `pk2` = "4" WHERE `pk1` = "2" AND `pk2` = "3";
UPDATE `change_in_pks` SET `pk2` = "5" WHERE `pk1` = "2" AND `pk2` = "6";
  • run dolt diff
    Screenshot 2024-05-01 at 2 32 52 PM

related dolthub pull: https://www.dolthub.com/repositories/liuliu/test1/pulls/11/compare

@nicktobey
Copy link
Contributor

I'm not sure this is even possible to fix without breaking history-independence.

Commits store the current state of the table, but they don't track the commands that led to that state. This is on purpose: two tables with the same state should have the same hash regardless of the sequence of commands that were run.

This means that there's no way for the commit to know that the added and deleted row were from the same update operation.

@liuliu-dev
Copy link
Contributor Author

liuliu-dev commented May 14, 2024

@nicktobey Seems this is not about the sequence of commands? I might be wrong since I'm not familiar with dolt implementation...

A modified row is displayed as deleted and added rows, and we hope to display them next to each other. But when a commit change happens in primary keys, it might mess up this order, since the diff rows are sorted by primary keys.

Here is another example that Tim found if that's clearer: https://www.dolthub.com/repositories/dolthub/hospital-price-transparency-v3/pulls/169/compare, where the values of column code changed from 51079- 0667- to 51079- 0667 in this commit, makes all the added rows go up.

Screenshot 2024-05-14 at 9 33 20 AM

@nicktobey
Copy link
Contributor

The reason I mentioned the sequence of commands is because, for instance, the diff you linked to could have happened in multiple ways:

  • An update command like UPDATE prices SET code = "51079- 0667" WHERE code = "51079- 0667-"
  • A deletion of the old rows followed by inserting the new rows.

The diff operation only sees the start and end state of the table, so it can't tell the difference between these two possibilities. It has to display the same diff for both.

One thing we could do is attempt to detect when a change was likely an update to a primary key, match each row-add to its corresponding row-remove, and display them next to each other. It means that would do this even if the change wasn't the result of changing a primary key, as long as it could have been... but that's probably rare enough that we don't mind.

The question is how we would detect these pairs of rows and whether we can do it efficiently.

@liuliu-dev
Copy link
Contributor Author

I see now, thanks for the clarification :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants