Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try and preserve the structure of the html during a diff #350

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
33 changes: 19 additions & 14 deletions src/lxml/html/diff.py
Expand Up @@ -6,6 +6,7 @@
from lxml import etree
from lxml.html import fragment_fromstring
import re
from collections import namedtuple

__all__ = ['html_annotate', 'htmldiff']

Expand Down Expand Up @@ -236,7 +237,7 @@ def merge_insert(ins_chunks, doc):
here we add <ins>ins_chunks</ins> to the end of that. """

ins_chunks = list(ins_chunks)
unbalanced_start, balanced, unbalanced_end = split_unbalanced(ins_chunks)
unbalanced_start, balanced, unbalanced_end = split_unbalanced(ins_chunks, with_idx=True)
unbalanced_tags = unbalanced_start + unbalanced_end

if doc and not doc[-1].endswith(' '):
Expand All @@ -245,10 +246,8 @@ def merge_insert(ins_chunks, doc):
doc[-1] += ' '
doc.append('<ins>')

for chunk in ins_chunks:
if chunk in balanced:
doc.append(chunk)
elif chunk in unbalanced_tags:
for idx, chunk in enumerate(ins_chunks):
if (idx, chunk) in unbalanced_tags:
leading_space = '' if chunk.endswith(' ') else ' '
trailing_space = '' if chunk.startswith(' ') else ' '
if doc[-1].strip() == '<ins>':
Expand All @@ -259,6 +258,9 @@ def merge_insert(ins_chunks, doc):
chunk,
('%s<ins>' % leading_space)
])
elif (idx, chunk) in balanced:
doc.append(chunk)


if doc[-1].strip() == '<ins>':
doc.pop()
Expand Down Expand Up @@ -326,7 +328,9 @@ def cleanup_delete(chunks):
chunks = doc
return chunks

def split_unbalanced(chunks):
IndexedChunk = namedtuple('IndexedChunk', ('idx', 'chunk'))

def split_unbalanced(chunks, with_idx=False):
"""Return (unbalanced_start, balanced, unbalanced_end), where each is
a list of text and tag chunks.

Expand All @@ -338,31 +342,32 @@ def split_unbalanced(chunks):
end = []
tag_stack = []
balanced = []
for chunk in chunks:
for idx, chunk in enumerate(chunks):
chunk_to_add = IndexedChunk(idx, chunk) if with_idx else chunk
if not chunk.startswith('<'):
balanced.append(chunk)
balanced.append(chunk_to_add)
continue
endtag = chunk[1] == '/'
name = chunk.split()[0].strip('<>/')
if name in empty_tags:
balanced.append(chunk)
balanced.append(chunk_to_add)
continue
if endtag:
if tag_stack and tag_stack[-1][0] == name:
balanced.append(chunk)
balanced.append(chunk_to_add)
name, pos, tag = tag_stack.pop()
balanced[pos] = tag
elif tag_stack:
start.extend([tag for name, pos, tag in tag_stack])
tag_stack = []
end.append(chunk)
end.append(chunk_to_add)
else:
end.append(chunk)
end.append(chunk_to_add)
else:
tag_stack.append((name, len(balanced), chunk))
tag_stack.append((name, len(balanced), chunk_to_add))
balanced.append(None)
start.extend(
[chunk for name, pos, chunk in tag_stack])
[chunk_to_add for name, pos, chunk_to_add in tag_stack])
balanced = [chunk for chunk in balanced if chunk is not None]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition in this line probably doesn't work any more if with_idx=True.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm, wouldn't it ? Here balanced would be something like [(1, '<div>'), None, ...] if with_idx==True, so the if chunk is not None check still does the right thing.

return start, balanced, end

Expand Down
14 changes: 11 additions & 3 deletions src/lxml/html/tests/test_diff.txt
Expand Up @@ -87,12 +87,20 @@ Whitespace is generally ignored for the diff but preserved during the diff:
second
<ins>third</ins> </pre>

Ensure we don't preserve the structure of the html during the diff:
Ensure we preserve the html structure on doing the diff:

>>> a = "<div id='first'>some old text</div><div id='last'>more old text</div>"
>>> b = "<div id='first'>some old text</div><div id='middle'>and new text</div><div id='last'>more old text</div>"
>>> print(htmldiff(a, b))
<div id="first"><ins>some old text</ins></div> <div id="middle"> <ins>and new</ins> <del>some old</del> text</div><div id="last">more old text</div>
>>> pdiff(a, b)
<div id="first"><ins>some old text</ins></div> <div id="middle">
<ins>and new</ins> <del>some old</del> text</div><div id="last">more
old text</div>
>>> a = "<div><p>Some text that will change</p><p>Some tags will be added</p></div>"
>>> b = "<div><div><p>Some text that has changed a bit</p><p>All of this is new</p></div></div>"
>>> pdiff(a, b)
<div><div><p>Some text that <ins>has changed a bit</ins> </p>
<p><ins>All of this is new</ins></p> </div> <del>will
change</del><p><del>Some tags will be added</del></p> </div>
Comment on lines +98 to +103
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff looks weird. Is it getting confused by the nested <div>s ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. In this specific case, the rendered diff looks 'ok' to me (with the insertions preceding the deletions) but yeah, I imagine in larger context it might be less than ideal. I'm going to take another stab at this.

Copy link
Author

@lonetwin lonetwin Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, fwiw, this isn't very different from the diff generated by the current code for the same input. The only difference being the closing </p> tag of the first input paragraph is placed after the </ins> instead of at the end of the changes for that paragraph.

current lxml diff:

<div><div><p>Some text that <ins>has changed a bit</ins><p><ins>All of
this is new</ins></p> </p></div> <del>will change</del><p><del>Some
tags will be added</del></p> </div>

diff generated with this patchset

<div><div><p>Some text that <ins>has changed a bit</ins> </p>
<p><ins>All of this is new</ins></p> </div> <del>will
change</del><p><del>Some tags will be added</del></p> </div>


The sixteen combinations::

Expand Down