Challenge getting delta from two document change allowing split/merge/move text inside json nodes #46

Hideman85 · 2021-07-14T13:36:20Z

Hello community,

I'm trying to find a way to compute delta on a json tree with the minimal size possible (so with the ability to split/move/match nodes)

I first tried to implement an algorithm from scratch using the slate operations that I gather each time the editor got an update and try to optimize them in a way to get only the relevant operations at the end. It is pretty easy to combine/merge text operations, but when you come to nodes operations, the task become a real nightmare.

I first get to know google diff-match-patch for plain text that works well on planar data to find out what has been moved or replaced, but like explained it do not work well for structured data.
Continuing the research by looking json diffing libraries, but you are facing libraries that do not try to find moving/splitting nodes and results with remove everything and insert everything.
Then you can find interesting work made on structured data with the homologous format XML and understand you are facing a NP-hard problem with a lot of academic research on the subject and lot of different libs with pros and cons (XMLDiff, XyDiff, graphtage, ...).

My issue I do not found yet a library that express my need and make simple changes that could be displayed as redlining.

Simple json example from a document tree

const initial = [
  {
    "type": "p",
    "children": [
      {
        "text": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin "
      },
      {
        "text": "efficitur",
        "bold": true
      },
      {
        "text": " sit amet sem a feugiat. Sed finibus eleifend elit a ultrices. Curabitur ac massa at mauris mollis varius."
      }
    ]
  }
]

const final = [
  {
    "type": "p",
    "children": [
      {
        "text": "Lorem ipsum dolor sit "
      },
      {
        "text": "amet",
        "bold": true
      },
      {
        "text": ", consectetur Hello World adipiscing elit. Proin "
      },
      {
        "text": "efficitur",
        "bold": true
      },
      {
        "text": " sit amet sem. Sed finibus "
      },
      {
        "text": "eleifend",
        "underline": true
      },
      {
        "text": " elit a ultrices. Curabitur ac massa at mauris mollis varius."
      }
    ]
  }
]

// Resulting deltas
const deltas = [
  { "op": "split", "path": [0, 0], "position": 26 },
  { "op": "split", "path": [0, 0], "position": 22 },
  { "op": "addProps", "path": [0, 1], "key": "bold", "value": true }
  ...
]

XML transposed of the json version

# initial
<document><node type="p"><text>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin </text><text bold="true">efficitur</text><text> sit amet sem a feugiat. Sed finibus eleifend elit a ultrices. Curabitur ac massa at mauris mollis varius.</text></node></document>

# final
<document><node type="p"><text>Lorem ipsum dolor sit </text><text bold="true">amet</text><text>, consectetur Hello World adipiscing elit. Proin </text><text bold="true">efficitur</text><text> sit amet sem. Sed finibus </text><text underline="true">eleifend</text><text> elit a ultrices. Curabitur ac massa at mauris mollis varius.</text></node></document>

Results when using XyDiff is replace everything (very old library)

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<xy:unit_delta xmlns:xy="urn:schemas-xydiff:xydelta">
  <xy:t fromXidMap="(9-14;3-4;15-20;7-8|21)">
    <xy:d par="7" pos="3" xm="(5-6)">
      <text> sit amet sem a feugiat. Sed finibus eleifend elit a ultrices. Curabitur ac massa at mauris mollis varius.</text>
    </xy:d>
    <xy:d par="7" pos="1" xm="(1-2)">
      <text>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin </text>
    </xy:d>
    <xy:i par="7" pos="1" xm="(9-10)">
      <text>Lorem ipsum dolor sit </text>
    </xy:i>
    <xy:i par="7" pos="2" xm="(11-12)">
      <text bold="true">amet</text>
    </xy:i>
    <xy:i par="7" pos="3" xm="(13-14)">
      <text>, consectetur Hello World adipiscing elit. Proin </text>
    </xy:i>
    <xy:i par="7" pos="5" xm="(15-16)">
      <text> sit amet sem. Sed finibus </text>
    </xy:i>
    <xy:i par="7" pos="6" xm="(17-18)">
      <text underline="true">eleifend</text>
    </xy:i>
    <xy:i par="7" pos="7" xm="(19-20)">
      <text> elit a ultrices. Curabitur ac massa at mauris mollis varius.</text>
    </xy:i>
  </xy:t>
</xy:unit_delta>

Results when using graphtage (both json & xml version is looping infinitely):

...
WARNING:graphtage.bounds:The most recent call to <function FixedLengthSequenceEdit.tighten_bounds at 0x7f5756c38790> on <graphtage.sequences.FixedLengthSequenceEdit object at 0x7f5730e519d0> returned bounds [191, 191] when the previous bounds were [212, 457]
WARNING:graphtage.bounds:The most recent call to <function FixedLengthSequenceEdit.tighten_bounds at 0x7f5756c38790> on <graphtage.sequences.FixedLengthSequenceEdit object at 0x7f5730e519d0> returned bounds [191, 191] when the previous bounds were [212, 457]
Diffing:  31%|███████████████████████████████████████████████████▊                                                                                                                  | 111/356 [00:02<00:04, 156.74it/s]

But it works when using another sample

# initial
<catalog>
  <product type="A">
    <status>Available</status>
    <name>Ipod Mp3 Player</name>
    <description>10G HD – Windows</description>
    <price>$450</price>
  </product>
  <product>
    <status>Not available</status>
    <name>Digital Camera</name>
    <description>Canon Ixus II</description>
  </product>
</catalog> 

# final
<catalog>
  <product type="B">
    <status>Available</status>
    <name>Ipod Mp3</name>
    <truc>Player</truc>
    <description>10G HD – Windows</description>
    <price>$460</price>
  </product>
  <product>
    <status>Available</status>
    <name>Digital Camera</name>
    <description>Canon Ixus II</description>
    <price>$300</price>
  </product>
</catalog> 

# delta
<catalog>                                                                                                                                                                                                             
  <product type="̶A̶"̶ -> "̟B̟"̟>                                                                                                                                                                                           
    <status>Available</status>                                                                                                                                                                                        
        <name>Ipod Mp3 ̶P̶l̶a̶y̶e̶r̶</name>
        <̟t̟r̟u̟c̟>̟P̟l̟a̟y̟e̟r̟<̟/̟t̟r̟u̟c̟>̟
        <description>10G HD – Windows</description>
        <price>$46̟5̶0</price>
    </product>
    <product>
    <status>A̟N̶o̶t̶ ̶a̶vailable</status>
        <name>Digital Camera</name>
        <description>Canon Ixus II</description>
        <̟p̟r̟i̟c̟e̟>̟$̟3̟0̟0̟<̟/̟p̟r̟i̟c̟e̟>̟
    </product>
</catalog>

I'm wondering how can I improve the results of the diffing? What if I'm preprocessing the data in a way? Is there other libraries with a better suitable algorithm?
At the end I found this topic related to many amount of studies and academic research but do not find anything suiting my case.

I would enjoy getting some help to figuring out this complex challenge.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Challenge getting delta from two document change allowing split/merge/move text inside json nodes #46

Challenge getting delta from two document change allowing split/merge/move text inside json nodes #46

Hideman85 commented Jul 14, 2021

Challenge getting delta from two document change allowing split/merge/move text inside json nodes #46

Challenge getting delta from two document change allowing split/merge/move text inside json nodes #46

Comments

Hideman85 commented Jul 14, 2021