Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try and preserve the structure of the html during a diff #350

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
61 changes: 41 additions & 20 deletions src/lxml/html/diff.py
Expand Up @@ -6,6 +6,7 @@
from lxml import etree
from lxml.html import fragment_fromstring
import re
from collections import namedtuple

__all__ = ['html_annotate', 'htmldiff']

Expand Down Expand Up @@ -234,22 +235,39 @@ def expand_tokens(tokens, equal=False):
def merge_insert(ins_chunks, doc):
""" doc is the already-handled document (as a list of text chunks);
here we add <ins>ins_chunks</ins> to the end of that. """
# Though we don't throw away unbalanced_start or unbalanced_end
# (we assume there is accompanying markup later or earlier in the
# document), we only put <ins> around the balanced portion.
unbalanced_start, balanced, unbalanced_end = split_unbalanced(ins_chunks)
doc.extend(unbalanced_start)

ins_chunks = list(ins_chunks)
unbalanced_start, balanced, unbalanced_end = split_unbalanced(ins_chunks, with_idx=True)
unbalanced_tags = unbalanced_start + unbalanced_end

if doc and not doc[-1].endswith(' '):
# Fix up the case where the word before the insert didn't end with
# a space
doc[-1] += ' '
doc.append('<ins>')
if balanced and balanced[-1].endswith(' '):
# We move space outside of </ins>
balanced[-1] = balanced[-1][:-1]
doc.extend(balanced)
doc.append('</ins> ')
doc.extend(unbalanced_end)

for idx, chunk in enumerate(ins_chunks):
if (idx, chunk) in unbalanced_tags:
leading_space = '' if chunk.endswith(' ') else ' '
trailing_space = '' if chunk.startswith(' ') else ' '
if doc[-1].strip() == '<ins>':
doc[-1:] = [chunk, ('%s<ins>' % leading_space)]
else:
doc.extend([
('</ins>%s' % trailing_space),
chunk,
('%s<ins>' % leading_space)
])
elif (idx, chunk) in balanced:
doc.append(chunk)


if doc[-1].strip() == '<ins>':
doc.pop()
else:
if doc[-1].endswith(' '):
doc[-1] = doc[-1][:-1]
doc.append('</ins> ')

# These are sentinels to represent the start and end of a <del>
# segment, until we do the cleanup phase to turn them into proper
Expand Down Expand Up @@ -310,7 +328,9 @@ def cleanup_delete(chunks):
chunks = doc
return chunks

def split_unbalanced(chunks):
IndexedChunk = namedtuple('IndexedChunk', ('idx', 'chunk'))

def split_unbalanced(chunks, with_idx=False):
"""Return (unbalanced_start, balanced, unbalanced_end), where each is
a list of text and tag chunks.

Expand All @@ -322,31 +342,32 @@ def split_unbalanced(chunks):
end = []
tag_stack = []
balanced = []
for chunk in chunks:
for idx, chunk in enumerate(chunks):
chunk_to_add = IndexedChunk(idx, chunk) if with_idx else chunk
if not chunk.startswith('<'):
balanced.append(chunk)
balanced.append(chunk_to_add)
continue
endtag = chunk[1] == '/'
name = chunk.split()[0].strip('<>/')
if name in empty_tags:
balanced.append(chunk)
balanced.append(chunk_to_add)
continue
if endtag:
if tag_stack and tag_stack[-1][0] == name:
balanced.append(chunk)
balanced.append(chunk_to_add)
name, pos, tag = tag_stack.pop()
balanced[pos] = tag
elif tag_stack:
start.extend([tag for name, pos, tag in tag_stack])
tag_stack = []
end.append(chunk)
end.append(chunk_to_add)
else:
end.append(chunk)
end.append(chunk_to_add)
else:
tag_stack.append((name, len(balanced), chunk))
tag_stack.append((name, len(balanced), chunk_to_add))
balanced.append(None)
start.extend(
[chunk for name, pos, chunk in tag_stack])
[chunk_to_add for name, pos, chunk_to_add in tag_stack])
balanced = [chunk for chunk in balanced if chunk is not None]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition in this line probably doesn't work any more if with_idx=True.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm, wouldn't it ? Here balanced would be something like [(1, '<div>'), None, ...] if with_idx==True, so the if chunk is not None check still does the right thing.

return start, balanced, end

Expand Down
19 changes: 17 additions & 2 deletions src/lxml/html/tests/test_diff.txt
Expand Up @@ -54,8 +54,8 @@ As a special case, changing the href of a link is displayed, and
images are treated like words:

>>> pdiff('<a href="http://yahoo.com">search</a>', '<a href="http://google.com">search</a>')
<a href="http://google.com">search <ins> Link: http://google.com</ins>
<del> Link: http://yahoo.com</del> </a>
<a href="http://google.com">search <ins> Link: http://google.com
</ins> <del> Link: http://yahoo.com</del> </a>
Comment on lines -57 to +58
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a necessary change? It doesn't feel right.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh, didn't feel right to me either but this test breaks without it, and I didn't have much luck trying to figure out why. Initial instinct was that it has something to do with the re.sub in the pwrapped definition at the top of the test file (possibly the + in the re over matches ?). Apologies, I didn't try digging deeper but I'll give it a shot now.

Copy link
Author

@lonetwin lonetwin Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the underlying cause for this appears to be an extra trailing whitespace that appears to get added to the href_token which pushes the line length of the string to over 70 (the line length limit of textwrap, used by pdiff). Here are the diffs with the current lxml code and one taken with this patchset:

current lxml diff:

<a href="http://google.com">search <ins> Link: http://google.com</ins> <del> Link: http://yahoo.com</del> </a>

diff with this patchset:
<a href="http://google.com">search <ins> Link: http://google.com </ins> <del> Link: http://yahoo.com</del> </a>

What I'm not entirely sure about is why the trailing whitespace which ideally should also be present in the current lxml diff (according to call to href_token in fixup_chunks), isn't.

In any case, would you consider this as a regression in behavior worthy of a fix or does this fix of the test case suffice ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What strikes me is that there is a space before the closing </ins> but not before the closing </del>. Given that there is a space after both of the opening <ins> and <del>, it's debatable whether the new behaviour isn't really better than the old. But I would expect both tag types to be consistent, at least.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I'll investigate and attempt to make them consistent.

>>> pdiff('<p>Print this <img src="print.gif"></p>', '<p>Print this</p>')
<p>Print this <del><img src="print.gif"></del> </p>
>>> pdiff('<a href="http://yahoo.com">search</a>', '<a href="http://yahoo.com">search</a>')
Expand Down Expand Up @@ -87,6 +87,21 @@ Whitespace is generally ignored for the diff but preserved during the diff:
second
<ins>third</ins> </pre>

Ensure we preserve the html structure on doing the diff:

>>> a = "<div id='first'>some old text</div><div id='last'>more old text</div>"
>>> b = "<div id='first'>some old text</div><div id='middle'>and new text</div><div id='last'>more old text</div>"
>>> pdiff(a, b)
<div id="first"><ins>some old text</ins></div> <div id="middle">
<ins>and new</ins> <del>some old</del> text</div><div id="last">more
old text</div>
>>> a = "<div><p>Some text that will change</p><p>Some tags will be added</p></div>"
>>> b = "<div><div><p>Some text that has changed a bit</p><p>All of this is new</p></div></div>"
>>> pdiff(a, b)
<div><div><p>Some text that <ins>has changed a bit</ins> </p>
<p><ins>All of this is new</ins></p> </div> <del>will
change</del><p><del>Some tags will be added</del></p> </div>
Comment on lines +98 to +103
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff looks weird. Is it getting confused by the nested <div>s ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. In this specific case, the rendered diff looks 'ok' to me (with the insertions preceding the deletions) but yeah, I imagine in larger context it might be less than ideal. I'm going to take another stab at this.

Copy link
Author

@lonetwin lonetwin Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, fwiw, this isn't very different from the diff generated by the current code for the same input. The only difference being the closing </p> tag of the first input paragraph is placed after the </ins> instead of at the end of the changes for that paragraph.

current lxml diff:

<div><div><p>Some text that <ins>has changed a bit</ins><p><ins>All of
this is new</ins></p> </p></div> <del>will change</del><p><del>Some
tags will be added</del></p> </div>

diff generated with this patchset

<div><div><p>Some text that <ins>has changed a bit</ins> </p>
<p><ins>All of this is new</ins></p> </div> <del>will
change</del><p><del>Some tags will be added</del></p> </div>


The sixteen combinations::

First "insert start" (del start/middle/end/none):
Expand Down