Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Challenge getting delta from two document change allowing split/merge/move text inside json nodes #46

Open
Hideman85 opened this issue Jul 14, 2021 · 0 comments

Comments

@Hideman85
Copy link

Hello community,

I'm trying to find a way to compute delta on a json tree with the minimal size possible (so with the ability to split/move/match nodes)

I first tried to implement an algorithm from scratch using the slate operations that I gather each time the editor got an update and try to optimize them in a way to get only the relevant operations at the end. It is pretty easy to combine/merge text operations, but when you come to nodes operations, the task become a real nightmare.

I first get to know google diff-match-patch for plain text that works well on planar data to find out what has been moved or replaced, but like explained it do not work well for structured data.
Continuing the research by looking json diffing libraries, but you are facing libraries that do not try to find moving/splitting nodes and results with remove everything and insert everything.
Then you can find interesting work made on structured data with the homologous format XML and understand you are facing a NP-hard problem with a lot of academic research on the subject and lot of different libs with pros and cons (XMLDiff, XyDiff, graphtage, ...).

My issue I do not found yet a library that express my need and make simple changes that could be displayed as redlining.

Simple json example from a document tree
const initial = [
  {
    "type": "p",
    "children": [
      {
        "text": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin "
      },
      {
        "text": "efficitur",
        "bold": true
      },
      {
        "text": " sit amet sem a feugiat. Sed finibus eleifend elit a ultrices. Curabitur ac massa at mauris mollis varius."
      }
    ]
  }
]

const final = [
  {
    "type": "p",
    "children": [
      {
        "text": "Lorem ipsum dolor sit "
      },
      {
        "text": "amet",
        "bold": true
      },
      {
        "text": ", consectetur Hello World adipiscing elit. Proin "
      },
      {
        "text": "efficitur",
        "bold": true
      },
      {
        "text": " sit amet sem. Sed finibus "
      },
      {
        "text": "eleifend",
        "underline": true
      },
      {
        "text": " elit a ultrices. Curabitur ac massa at mauris mollis varius."
      }
    ]
  }
]

// Resulting deltas
const deltas = [
  { "op": "split", "path": [0, 0], "position": 26 },
  { "op": "split", "path": [0, 0], "position": 22 },
  { "op": "addProps", "path": [0, 1], "key": "bold", "value": true }
  ...
]
XML transposed of the json version
# initial
<document><node type="p"><text>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin </text><text bold="true">efficitur</text><text> sit amet sem a feugiat. Sed finibus eleifend elit a ultrices. Curabitur ac massa at mauris mollis varius.</text></node></document>

# final
<document><node type="p"><text>Lorem ipsum dolor sit </text><text bold="true">amet</text><text>, consectetur Hello World adipiscing elit. Proin </text><text bold="true">efficitur</text><text> sit amet sem. Sed finibus </text><text underline="true">eleifend</text><text> elit a ultrices. Curabitur ac massa at mauris mollis varius.</text></node></document>
Results when using XyDiff is replace everything (very old library)
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<xy:unit_delta xmlns:xy="urn:schemas-xydiff:xydelta">
  <xy:t fromXidMap="(9-14;3-4;15-20;7-8|21)">
    <xy:d par="7" pos="3" xm="(5-6)">
      <text> sit amet sem a feugiat. Sed finibus eleifend elit a ultrices. Curabitur ac massa at mauris mollis varius.</text>
    </xy:d>
    <xy:d par="7" pos="1" xm="(1-2)">
      <text>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin </text>
    </xy:d>
    <xy:i par="7" pos="1" xm="(9-10)">
      <text>Lorem ipsum dolor sit </text>
    </xy:i>
    <xy:i par="7" pos="2" xm="(11-12)">
      <text bold="true">amet</text>
    </xy:i>
    <xy:i par="7" pos="3" xm="(13-14)">
      <text>, consectetur Hello World adipiscing elit. Proin </text>
    </xy:i>
    <xy:i par="7" pos="5" xm="(15-16)">
      <text> sit amet sem. Sed finibus </text>
    </xy:i>
    <xy:i par="7" pos="6" xm="(17-18)">
      <text underline="true">eleifend</text>
    </xy:i>
    <xy:i par="7" pos="7" xm="(19-20)">
      <text> elit a ultrices. Curabitur ac massa at mauris mollis varius.</text>
    </xy:i>
  </xy:t>
</xy:unit_delta>

Results when using graphtage (both json & xml version is looping infinitely):

...
WARNING:graphtage.bounds:The most recent call to <function FixedLengthSequenceEdit.tighten_bounds at 0x7f5756c38790> on <graphtage.sequences.FixedLengthSequenceEdit object at 0x7f5730e519d0> returned bounds [191, 191] when the previous bounds were [212, 457]
WARNING:graphtage.bounds:The most recent call to <function FixedLengthSequenceEdit.tighten_bounds at 0x7f5756c38790> on <graphtage.sequences.FixedLengthSequenceEdit object at 0x7f5730e519d0> returned bounds [191, 191] when the previous bounds were [212, 457]
Diffing:  31%|███████████████████████████████████████████████████▊                                                                                                                  | 111/356 [00:02<00:04, 156.74it/s]
But it works when using another sample
# initial
<catalog>
  <product type="A">
    <status>Available</status>
    <name>Ipod Mp3 Player</name>
    <description>10G HD – Windows</description>
    <price>$450</price>
  </product>
  <product>
    <status>Not available</status>
    <name>Digital Camera</name>
    <description>Canon Ixus II</description>
  </product>
</catalog> 

# final
<catalog>
  <product type="B">
    <status>Available</status>
    <name>Ipod Mp3</name>
    <truc>Player</truc>
    <description>10G HD – Windows</description>
    <price>$460</price>
  </product>
  <product>
    <status>Available</status>
    <name>Digital Camera</name>
    <description>Canon Ixus II</description>
    <price>$300</price>
  </product>
</catalog> 

# delta
<catalog>                                                                                                                                                                                                             
  <product type="̶A̶"̶ -> "̟B̟"̟>                                                                                                                                                                                           
    <status>Available</status>                                                                                                                                                                                        
        <name>Ipod Mp3 ̶P̶l̶a̶y̶e̶r̶</name>
        <̟t̟r̟u̟c̟>̟P̟l̟a̟y̟e̟r̟<̟/̟t̟r̟u̟c̟>̟
        <description>10G HD – Windows</description>
        <price>$46̟5̶0</price>
    </product>
    <product>
    <status>A̟N̶o̶t̶ ̶a̶vailable</status>
        <name>Digital Camera</name>
        <description>Canon Ixus II</description>
        <̟p̟r̟i̟c̟e̟>̟$̟3̟0̟0̟<̟/̟p̟r̟i̟c̟e̟>̟
    </product>
</catalog>

I'm wondering how can I improve the results of the diffing? What if I'm preprocessing the data in a way? Is there other libraries with a better suitable algorithm?
At the end I found this topic related to many amount of studies and academic research but do not find anything suiting my case.

I would enjoy getting some help to figuring out this complex challenge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant