New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try and preserve the structure of the html during a diff #350
base: master
Are you sure you want to change the base?
Changes from 5 commits
ba831cf
c4c504c
b340f0a
66e89a8
f4c38c6
2c9aca8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -54,8 +54,8 @@ As a special case, changing the href of a link is displayed, and | |
images are treated like words: | ||
|
||
>>> pdiff('<a href="http://yahoo.com">search</a>', '<a href="http://google.com">search</a>') | ||
<a href="http://google.com">search <ins> Link: http://google.com</ins> | ||
<del> Link: http://yahoo.com</del> </a> | ||
<a href="http://google.com">search <ins> Link: http://google.com | ||
</ins> <del> Link: http://yahoo.com</del> </a> | ||
Comment on lines
-57
to
+58
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Was this a necessary change? It doesn't feel right. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. tbh, didn't feel right to me either but this test breaks without it, and I didn't have much luck trying to figure out why. Initial instinct was that it has something to do with the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So the underlying cause for this appears to be an extra trailing whitespace that appears to get added to the current lxml diff:
diff with this patchset: What I'm not entirely sure about is why the trailing whitespace which ideally should also be present in the current lxml diff (according to call to In any case, would you consider this as a regression in behavior worthy of a fix or does this fix of the test case suffice ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What strikes me is that there is a space before the closing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree. I'll investigate and attempt to make them consistent. |
||
>>> pdiff('<p>Print this <img src="print.gif"></p>', '<p>Print this</p>') | ||
<p>Print this <del><img src="print.gif"></del> </p> | ||
>>> pdiff('<a href="http://yahoo.com">search</a>', '<a href="http://yahoo.com">search</a>') | ||
|
@@ -87,6 +87,13 @@ Whitespace is generally ignored for the diff but preserved during the diff: | |
second | ||
<ins>third</ins> </pre> | ||
|
||
Ensure we don't preserve the structure of the html during the diff: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "don't" ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah sorry, that was a brain fart ! Good catch. I think I might have wanted to add something like 'Ensure we don't mess with the structure ...' and instead decided to go with the opposite phrasing 'Ensue we preserve ...' but obviously messed up. I'll fix in the next commit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
|
||
>>> a = "<div id='first'>some old text</div><div id='last'>more old text</div>" | ||
>>> b = "<div id='first'>some old text</div><div id='middle'>and new text</div><div id='last'>more old text</div>" | ||
>>> print(htmldiff(a, b)) | ||
<div id="first"><ins>some old text</ins></div> <div id="middle"> <ins>and new</ins> <del>some old</del> text</div><div id="last">more old text</div> | ||
|
||
The sixteen combinations:: | ||
|
||
First "insert start" (del start/middle/end/none): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems risky. I'm not sure that we can assume chunks to be unique enough to just do an "in" test here.
Maybe it could help to put the three parts back together, but give each chunk a marker like
(chunk, "balanced")
and(chunk, "unbalanced")
, and then do something based on the marker. Or look at the index boundaries where we switch between the unbalanced and balanced parts.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, agreed that we can't assume the chunks to be unique enough to just do an 'in' test here. I'll think about your suggestion and give this another go. Thanks for your review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ran into an issue rising out of precisely the problem you caught ! The visible effect being that the
<ins>
tags were not being added at places where they ought to be.I've updated the approach by changing
split_unbalanced
to optionally return a list of tuples containing the(<index-in-the-input=chunks-list>, <chunk>)
. This then serves to uniquely identify the chunk across the balanced/unbalanced lists ...which then makes thein
operator reliable.I've defaulted to retain the existing behavior of
split_unbalanced
since it is also called fromcleanup_delete
and I didn't want to touch that function.Hopefully, you find the changes suitable.