Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smallest Edit Distance #50

Open
FHantke opened this issue Jul 27, 2022 · 4 comments
Open

Smallest Edit Distance #50

FHantke opened this issue Jul 27, 2022 · 4 comments

Comments

@FHantke
Copy link

FHantke commented Jul 27, 2022

Hi, thank you for this great working tool.
I wonder if it is somehow possible to get the smallest edit distance?
For instance, in the following example, one could just add the missing node instead of changing all the children.
Json A:

{
  "list": [
    {"a": [{"x": "y"}]},
    {"b": [{"x": "y"}]},
    {"c": [{"x": "y"}]},
    {"d": [{"x": "y"}]},
    {"e": [{"x": "y"}]}
  ]
}

Json B:

{
  "list": [
    {"a": [{"x": "y"}]},
    {"b": [{"x": "y"}]},
    {"d": [{"x": "y"}]},
    {"e": [{"x": "y"}]}
  ]
}

Output:

@ ["list",4]
- {"e":[{"x":"y"}]}
@ ["list",3,"d"]
- [{"x":"y"}]
@ ["list",3,"e"]
+ [{"x":"y"}]
@ ["list",2,"c"]
- [{"x":"y"}]
@ ["list",2,"d"]
+ [{"x":"y"}]

Thank you,
FHantke

@josephburnett
Copy link
Owner

This is something I've been thinking about for a while in the background. There are a few interesting choices to make in the design and implementation. One key design input is our working definition of "minimal". Are we trying to minimize /1/ the number of diff hunks (unique paths with + and - after), /2/ the size of the JSON payload in each hunk (line width), or /3/ the number of "nodes" we touch in the overall structure?

To minimize the number of diff hunks we could simply replace the entire document. One hunk with a root path, a big - and a big +. But that's not the point of taking a diff and applying a patch. So we need some other aspect of "minimal".

I think minimizing the JSON payload and minimizing the number of "nodes" we touch is roughly the same thing, but I might be wrong. Jd already does this by recursing as deep as possible into the structure to find differences. But as you point out, it misses the big picture. Sometimes it's easier (more "minimial") to replace a higher node. E.g. [1,2,3] to [3,1,2] should be one hunk rather than three. And sometimes describing an "insert" or "removal" is easier than describing the rest of the array. The same goes for objects. When is it easier to replace the whole object than to replace it's elements one-by-one?

Let's start with arrays and see how far that gets us. We can change our minds later as long as we consistently produce valid diffs.

Jd operates in a "strict" mode by default where it asserts "context" when producing diffs. That's why deletion includes the exact data to delete. And updates include both - and + to assert what is being replace. Jd will produce a strict diff by default with no additions or deletions in the middle of an array. We can add and delete from the middle of an array, shifting the contents (added as part of JSON Path RFC 6902 support). But that would not provide strict context and would not be idempotent (you could apply a patch twice!) So we need a concept of "context" for mid-array insertions.

I propose we include (part of) the 8-byte hash code before and after the change as metadata on the numerical index. This would restore idempotence while producing backward compatible diffs (unknown metadata is ignored).

The next question would be when to produce such context. A standard way to produce minimal diff of line-oriented data is to calculate the longest common subsequence (LCS) of lines (treating the entire line as a token). And then emit a diff of additions and deletions outside that sequence. We could do the same for arrays (in list mode; sets and multisets have different semantics) by calculating hash codes for all values in the array. This will produce a much nicer diff in your example, inserting an object into the middle of an array of objects.

However it also means recursion stops at arrays. Small edits to objects in arrays will replace the entire object (hash code won't match). So we may choose to further optimize by emitting pure additions and deletions by LCS, but recursing when an object is replaced by an object.

What do you think so far?

@FHantke
Copy link
Author

FHantke commented Aug 8, 2022

Thanks for the detailed answer. Let me describe my use case first, before I go into your answer. In my current project, I need to find difference categories between numerous large JSON objects. Since similar differences appear in various JSON objects multiple times and I cannot manually evaluate all of them, I use jd to show me the differences so I can label them. When I compare other object and the same difference appears again, the plan is that my tool labels them correctly by using the data I labeled before. Of course, I would like to have as little unnecessary differences as possible to label which is why I asked for the edit distance. (If you have an idea how to best label differences to distinguish them later, let me know).

Now, back to your questions. The first question is indeed interesting and one I did not think about yet. For my use case, I think, finding the solution with the smallest JSON payload per hunk is the fitting one as it provides more details on the various differences than one big hunk. One big hunk could (in my case) contain two difference categories which would mean I loose information. If two the payloads have the same length, the one with the fewest diff hunks should be preferred. Maybe a paper I found during my research could help with this question. I only skimmed it, yet it sounds like they had similar problems.

I am not quit sure I understand your definition of "nodes". Would option 3 not be the same as option 1? Every node we modify in the JSON tree ends up as a diff hunk in the output, right? This would mean, minimizing diff hunks is similar to minimizing the number of nodes we touch (which is the opposite of minimizing the payload in each hunk)?

Your proposal with the hashes sounds fine, I guess (but I'm not good with algorithms). However, if the recursion would stop at arrays, a better optimization is needed. Probably, the algorithm could look wether the array contains objects or not? Otherwise, an array of large objects with one small difference in a leaf node would result in the replacement of the entire large object.

One other thing I have in mind which I cannot estimate is the complexity of the algorithm. When I was researching JSON comparison tools, I decided to use jd since it was able to compare my large objects. Other tools never came to an end. How does the hashing approach influence the performance?

@josephburnett
Copy link
Owner

If the same difference (edit) appears in multiple objects in an array, the new and old values should be identical. So you could just compare those. Are you using jd as a golang library, or a commandline tool?

Maybe minimizing the number of hunks produced from an array (minimum edit distance) but minimizing the size of hunks within objects. Thanks for the pointer to that paper! They aren't using the same edit operations as jd, but it might be useful.

Yeah, options 1 and 3 are the same, you're right. So it's basically size of hunks versus number of hunks. I'll try to strike a balance as described above ^^^.

Implementing the longest common subsequence algorithm will go above O(n) but I'll have to measure it in practice to see how much. And performance on real data sets will matter too. Can you tell me how many objects you are processing in a single array? And how big those objects are? What are your time constraints?

@FHantke
Copy link
Author

FHantke commented Sep 26, 2022

Hi, I am currently using the commandline tool since my pipeline is coded in Python and I just call jd as subprocess.

I attached three example files (two that are similar and one additional one). My time constraints are not super critical, I just don't want to wait more than a minute for one comparison as I'm comparing multiple thousands objects in parallel :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants