Stop breaking surrogate pairs in toDelta()/fromDelta() #80

dmsnell · 2019-11-09T05:07:44Z

Fixes #10, #59, #68
Alternate to #69

Fixes for Java, JavaScript, Objective C, Python2, Python3

Status

Please scrutinize the code and think of any test cases not covered already in this patch.
~~We've deployed these changes in Simplenote and are uncovering any issues we can find.~~ We've been running the code in this patch for years now and no issues or regressions have appeared.
Please let us know your thoughts on the direction here. This is a pragmatic patch in that it's addressing the problem where it's at and tries to impose no new constraints on the library where possible.
Since we haven't received any direction it's my intention to merge this into my fork when done and start referencing dmsnell/diff-match-patch instead of google/diff-match-patch. I'd prefer merging these fixes upstream and would happily do the work required if @NeilFraser provides guidance or opinions or a simple 👍.
Everything in this patch improves current problems. Even though this doesn't resolve all of the problems breaking surrogate pairs everywhere it's only a net win for many use cases. I'm working on fixing this everywhere though, but don't see any reason to

Todo:

Major changes

in toDelta() we no longer attempt to create URI-encoded strings with split surrogates. this was either crashing client applications or corrupting the data, depending on which platform was in use.
in fromDelta() we are recovering when possible from corrupted delta strings produced by unpatched versions of this library. there are two fixes:
- sometimes we encode surrogate-pair halves as if they were valid Unicode (they aren't). we'll now decode those anyway because after applying a diff with these invalid code points we should re-join the surrogate halves and get valid Unicode out
- in unpatched ObjectiveC libraries we might have created invalid diffs where we send (null) in between two surrogate halves. this is caused by the original bug and we might receive these sequences in a delta string: (high surrogate)(null)(low surrogate). If we leave these in here then diff-match-patch will operate fine but it will send invalid Unicode that it created onto the consuming application. in this patch I've added guards against this very specific pattern so that we can remove the unintentionally injected (null) and re-join the surrogate halves. this means that we're mangling the input but that input would have already been mangled anyway and presumably it'd be "impossible" to send the same sequence of code units to diff-match-patch since something would have had to send invalid Unicode in the first place.

Working on this bug has surfaced a bigger question about what diff-match-patch expects. All of the Simplenote applications work at the level of UTF-16 code units and a large part of that is because this is how most of the diff-match-patch platforms work.

Something seems ideal about referencing Unicode code points in diff-match-patch instead of any particular encoding, but to change now seems like it would break way more than it would fix. I'd like to propose that we think about creating a new output version which could be selected to specify that diffs do that.

diff_toDelta(
  diff_main( "🅰🅲", "🅰🅱🅲" )
  // no specification means existing behavior
) // "=2	+%F0%9F%85%B1	=2";

diff_toDelta(
  diff_main( "🅰🅲", "🅰🅱🅲" ),
  { unit: 'native' } // existing behavior - whatever the unit of a string is
) // "=?	+%F0%9F%85%B1	=?";

diff_toDelta(
  diff_main( "🅰🅲", "🅰🅱🅲" ),
  { unit: 'ucs2' } // make UCS2 the "default" since most libraries used it already, or something
) // "=2	+%F0%9F%85%B1	=2";

diff_toDelta(
  diff_main( "🅰🅲", "🅰🅱🅲" ),
  { unit: 'codePoint' } // empty addition group at the beginning to denote Unicode indices
) // "+	=1	+%F0%9F%85%B1	=1"; // empty group ignored by old clients

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into toDelta() we do just that and the
library crashes. In this patch we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but shouldn't change
that much - only by a single character at surrogate boundaries.

Notes

This is my first contribution to this repo and I'm obviously not familiar with the style and this patch is obviously incomplete. My purpose is to propose code with working tests to demonstrate the change and I expect to clean up any remaining details that need resolving.
I've pulled in the surrogate tests that @josephrocca added and they are all passing. In JS: Handle surrogate pairs correctly #69 the issue causing these failed tests was using isSurrogate() for all cases when we should have been using isHighSurrogate() and isLowSurragate() depending on the context.

Thanks for any and all help and feedback.

See google/diff-match-patch#80 There have been certain cases where consecutive surrogate pairs crash `diff-match-patch` when it's building the _delta string_. This is because the diffing algorithm finds a common prefix up to and including half of the surrogate pair match when three consecutive code points share the same high surrogate. ``` // before - \ud83d\ude4b\ud83d\ue4b // after - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b // deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b ``` This crashes when trying to encode the invalid sequence of UTF-16 code units into `URIEncode`, which expects a valid Unicode string. After the fix the delta groups are normalized to prevent this situation and the `node-simperium` library should be able to process those problematic changesets.

See: Simperium/node-simperium#91 See: google/diff-match-patch#80 There have been certain cases where consecutive surrogate pairs crash `diff-match-patch` when it's building the _delta string_. This is because the diffing algorithm finds a common prefix up to and including half of the surrogate pair match when three consecutive code points share the same high surrogate. ``` // before - \ud83d\ude4b\ud83d\ue4b // after - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b // deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b ``` This crashes when trying to encode the invalid sequence of UTF-16 code units into `URIEncode`, which expects a valid Unicode string. After the fix the delta groups are normalized to prevent this situation and the `node-simperium` library should be able to process those problematic changesets. This patch updates the dependency on the patched version of `node-simperium`.

See google/diff-match-patch#80 There have been certain cases where consecutive surrogate pairs crash `diff-match-patch` when it's building the _delta string_. This is because the diffing algorithm finds a common prefix up to and including half of the surrogate pair match when three consecutive code points share the same high surrogate. ``` // before - \ud83d\ude4b\ud83d\ue4b // after - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b // deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b ``` This crashes when trying to encode the invalid sequence of UTF-16 code units into `URIEncode`, which expects a valid Unicode string. After the fix the delta groups are normalized to prevent this situation and the `node-simperium` library should be able to process those problematic changesets.

* Fix: No more crashing/skipped changes for certain changes See: Simperium/node-simperium#91 See: google/diff-match-patch#80 There have been certain cases where consecutive surrogate pairs crash `diff-match-patch` when it's building the _delta string_. This is because the diffing algorithm finds a common prefix up to and including half of the surrogate pair match when three consecutive code points share the same high surrogate. ``` // before - \ud83d\ude4b\ud83d\ue4b // after - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b // deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b ``` This crashes when trying to encode the invalid sequence of UTF-16 code units into `URIEncode`, which expects a valid Unicode string. After the fix the delta groups are normalized to prevent this situation and the `node-simperium` library should be able to process those problematic changesets. This patch updates the dependency on the patched version of `node-simperium`.

dmsnell · 2019-12-05T16:58:45Z

@NeilFraser is there anything I can do to help this patch/issue move along?

dmsnell · 2019-12-13T06:56:29Z

Quick update for those following: we identified issues when the surrogate pairs were in DIFF_DELETE groups which led to crashing behaviors.

I've updated the PR with an improved patch to the JS/Java libraries and hope to port the changes to the other libraries tomorrow.

In this update I've included a custom decodeURI. The Python library happily URI-encodes a string that isn't valid UTF-8 and that causes other libraries to crash. Well, it's fine by us if we treat half of a surrogate pair as its own code point. The illegitimate part comes because those code points are reserved for surrogate pairs, not because there's no way to decode the value as if it were a real code point.

This custom decodeURI makes it possible for clients to handle diff deltas produced by libraries that don't have the fixes in this PR.

NeilFraser · 2019-12-17T05:00:19Z

Q: What happens in the different languages when diffing "人" vs "中"? JavaScript should just say delete one character, insert another. But what about Python?

dmsnell · 2019-12-17T19:47:19Z

Thank you so much @NeilFraser for the response!

Q: What happens in the different languages when diffing "人" vs "中"? JavaScript should just say delete one character, insert another. But what about Python?

Right now it appears like diff-match-patch leaves this unspecified. The Lua library of course makes this plain and clear.

It doesn't even need to be a question about language though. Python2 is internally inconsistent because if Python is compiled in narrow mode it returns different numbers than if Python is compiled in wide mode.

In my patch I'm normalizing doDelta on UTF-16 code units because that seems most consistent with most libraries in most situations. Fundamentally I don't see a way to provide full backwards compatibility on this change because there was no backwards behavior - or the behavior was already backwards 🙃

In other words before my patch it's more different across libraries and platforms than after it. After this patch the languages which have been implemented here should all return the same delta - delete one character, insert another.

In tests across JavaScript, Java, Objective-C, Python2, Python3

dmsnell · 2019-12-20T23:42:14Z

@NeilFraser is there a way to run the suite of relative performance diffs or is that simply generated from running speedtest in each language/platform?

I've been running speedtest on this PR and I haven't seen anything significant in terms of performance. I think I've buttoned up about as much as we can with the bug.

the only real major issue is the question of what the indices should be, and I've tried to not change that other than in Python2 compiled in wide mode and in Python 3. we can back out the Python 3 change if we need to preserve the behavior. the other languages/platforms in this patch all previously returned UCS-2 code units.

salzhrani · 2020-01-14T12:24:53Z

We also faced an issue with surrogate pairs here where the substring breaking doesn't account for surrogate pairs.

In 1.0.2 we applied an update from google/diff-match-patch#80 to resolve the surrogate pair encoding issue. Since that time we found a bug in the fix and resolved that and therefore this patch brings those additional updates to Simeprium. There previously remained issues when decoding broken patches that already existed or which came from unpatched versions of the iOS library and also remained issues when unexpectedly receiving empty diff groups in `toDelta`

dmsnell · 2020-03-13T02:02:22Z

@NeilFraser what can we do to be helpful here? We've been running with these changes in Simplenote for more than a month and it eliminated all of the encoding issues we were experiencing.

I know that there is still a "better option" of adjusting the way that we break strings up so that we never get broken surrogate halves, but from a practicality and simplicity standpoint I think this is solid (maybe there are bugs we've silently missed).

dmsnell · 2020-03-29T06:55:23Z

When reading through a Rust implementation I found that the author created a separate cleanup_char_boundary(diffs) function to adjust the output when the diff groups would otherwise split. that's a method very close to this one except it occurs in the diffing stage instead of the delta production.

https://github.com/dtolnay/dissimilar/blob/master/src/lib.rs#L402

mustafamunawar · 2020-12-21T00:24:22Z

@dmsnell What's the status of this? Does Unicode work correctly yet in master?

dmsnell · 2020-12-21T03:41:46Z

@mustafamunawar we are waiting feedback or direction from @NeilFraser. We have been using this branch successfully in Simplenote for the past nine months without issue. You are welcome to try it out and watch this PR for an eventual merge.

dmsnell · 2021-05-22T13:13:54Z

@dmak

first doing the diff on byte level and then mutating (extending / shrinking) a bit the resulting diffs so that they do not eventually stop in the middle of multi-byte sequence - @dmak

this is the basic procedure in this PR as well as has been taken in the dissimilar library linked above. it's a good idea and one that solves the problem of crashing when splitting Unicode boundaries.

linking here to avoid clouding your Maven PR (#2) with noise about Unicode boundaries

dmsnell · 2022-05-26T21:16:02Z

@NeilFraser I ran into this problem again today through a dependency-of-a-dependency pulling in diff-match-patch (via jsondiffpatch) and wondering what I can do to help move this forward.

For Simplenote we're able to pull in this branch and that works, but with the transitive dependencies in other projects we don't get as much of a luxury. Would rather avoid forking the repo especially since everything in the world basically refers to this one (and you've put so much work into it), but I feel like if we don't fix master we're going to continue to spot this Unicode violation over and over again and I don't know how to get there without your help.

Again, I'm happy to put more energy into the fix. We've been going for more than two years with this branch in Simplenote and as far as we can infer the issue is entirely eliminated and no side-effects have crept in. Our users write some very long texts too and we've had no indication of performance issues.

The approach in the Rust library I think is more elegant than my solution but I think mine here is the least intrusive on the overall library, addressing the problem of splitting surrogate pairs only where it becomes a problem, that is, when URL-encoding the patch.

Really appreciate all you've built here and I assume you're busy; just trying to find a way to keep this from being a constant headache.

gamedevsam · 2022-08-24T04:17:48Z

I've tried all 3 solutions (#13, #69, #80) to this problem and #13 is the only one that solved the encodingURI exception for me. I created a minimal repo which clearly reproduces the issue: https://github.com/gamedevsam/diff-match-patch-emoji-issue

Hope it's helpful to make this solution more robust.

dmsnell · 2022-09-10T20:59:47Z

@gamedevsam I think it's possible you confused diff-match-patch with jsondiffpatch. This patch has been working fine with Simplenote for coming up on three years now and has been fully reliable.

It does limit itself to toDelta, as noted, and jsondiffpatch is using the toText functions. If you'd like you can propose a commit on this PR to add support there as the process should be the same as with toDelta. I'm busy on other things so I probably won't do this any time soon without other contributions and since @NeilFraser has been silent on the issue (if @NeilFraser could give us some direction or indication that this would merge I'd go ahead and clean up the patch).

Unfortunately #13 has a massive performance problem particularly for long texts and likely that approach (converting strings to arrays of integers) won't be viable for a generic library like this, especially since this patch (#80) has the fix to get to the same result.

You can check with the author of jsondiffpatch and look at these lines. Get that library to use this patch and then apply the same updates to toText as we have with toDelta and it should work for you 👍

michal-kurz · 2022-10-04T15:12:52Z

@dmsnell Thank you so much for all the work you've put into this accross all the threads! I initially tried fixing this myself, and you helped me realize my approach was way too naive! :)

We use jsondiffpatch, and are running into this problem. Since this repo seems to be dead, I would love to make a public jsondiffpatch fork that fixes this issue. I came up with a POC fix of toText() function which builds on top of your PR, Here it is. Would you please care to review it? It seems to fix our use-case, but it's very difficult for me predict if it could introduce any new problems (the tests pass, but that doesn't give me much confidence) - maybe you will see something I don't :)

About my solution:

I pretty much just separated your surrogate-pair-joining logic into a separate function, as it seems to me like a concern separate from any other - it may make sense to use it in toDelta() also, but I didn't do that for now to keep the diff minimal
I'm also wondering whether it's optimal to be fixing this problem at point of toDelta()/toText() conversion - have you maybe considered fixing at patch creation, so this problem would never occur in the first place? Maybe inside patch_make, or something. Although it may very well be that other functions like diff_cleanupSemantic could re-break the patch... I haven't tried this, just wondering if there could be a central way to fix this once for good :)
It's quite clear that dpm prioritizes performance above all else, and it's hard for me to reason about such micro-optimizations.
- Do you think there is any value in not creating a separate newDiffs array, but writing into the existing one instead (and popping the difference in length in the end, since your algo skips empty patches)?
- Likewise, it might be preferable not to use my function inside toDelta(), as it would require an additional iteration of all diffs for each patch

🙏

dmsnell · 2022-10-04T20:05:13Z

Nice work @michal-kurz. If you want to propose a PR with that code against this branch we can merge it in. I'd like to explore using your function in toDelta, and I'd like to look at removing the extra iteration, as I don't believe it offers any value above the cloning method you took (though I'm happy to be shown something I'm overlooking!)

michal-kurz · 2022-10-04T23:22:39Z

@dmsnell Thank you! I made a PR to your branch. Unfortunately there are some unwanted changes in whitespaces comming with it - I spent the last 25 minutes failing to undo them, and the Github diff keeps looking different from my local, no idea why. I'm way too frustrated to continue with that at this point 😠

As for incorporating the splitSurrotagePairs into toDelta: yes, I don't think there is anything to gain other than proper separation of that concern, which I tend to generally value much higher than preventing an additional iteration - I really don't think there is a clean way to retain both. But I guess you could find some middle ground (i.e. sending a callback inside splitSurrogatePairs to do the bidding of the invoker rather than returning corrected diffs).

dmsnell · 2022-10-04T23:34:47Z

Unfortunately there are some unwanted changes in whitespaces comming with it - I spent the last 25 minutes failing to undo them, and the Github diff keeps looking different from my local

oh the pain; I feel you: been there, done that. typically I find that some IDE is adding or removing the extra whitespace. did you try using nano or vim or pico or some editor that won't muck around with anthing?

might also verify you don't have custom pre-commit hooks that are running formatting.

by the way I suspect you meant to poke me, @dmsnell, and not dmaclach

michal-kurz · 2022-10-04T23:55:06Z

@dmsnell Solved, I ended up soft-resetting, re-commiting only the desired lines and force-pushing the result.

And yes, I did mess up the mention tag, also fixed.

michal-kurz · 2022-10-06T22:50:00Z

So unfortunately I found my solution for toText to be incomplete.

I sometimes end up with a single surrogate on the very edge of the whole patch (first char of first diff, or last char of last diff). For example:

str1 = '. 🔧🚗\n\--X'
str2 = '. 🔧🚗\n\--Y'

dmp.patch_make(str1, str2)
// Results in three diffs:
// [
//    {0: 0, 1: '\uDE97\n--'},
//    {0: -1, 1: 'X'},
//    {0: 1, 1: 'Y'}
// ]

Such case cannot be fully "losslessly" shifted with @dmsnell 's approach - I have to decide to either expand or shrink the whole diff scope.

In my case, the broken diffs were created as prefix and suffix generated inside patch_addContext_ - those happen to get sampled at a fixed distance from the actual change.

I patched those by extending them one character outward in case it's an unpaired surrogate, and this seems to have solved the problem (here is a PR targeted at our fork of dmp). I decided for this approach, because I'm worried that shrinking the scope of the diff could cause some unforseen problems (mainly when coverting back to patch via fromText and then using inside other functions). And I unfortunately couldn't do this inside toText, because I don't have access to the full text there (and thus don't know what character to expand my diff with).

I hope this is the only potential source of such broken patch. If so, the toDelta solution in this PR should be safe, as prefix and suffix are both DIFF_EQUAL, and toDelta is only concerned about diff's length in that case, not its contents.

dmsnell · 2023-05-10T15:48:00Z

@michal-kurz I added a patch that's almost identical to yours here (should have read yours first 🙃) but I make a couple extra bounds checks to avoid extending the patch context before or after the document boundaries.

had to apply both your fix though and the _toDelta fix at the same time else we were still getting split surrogates in the patch_ family of functions. by moving my fix into a new cleanupSplitSurrogates function though that seems to cover the bases.

one thing I wish were better is that when I moved cleanupSplitSurrogates into the same chain of fixup commands through the stages of diffing, it seemed to break. as is, it appears to be necessary as the last cleanup in the chain, reserved for output functions that encode URIs.

that being said, we're still without regression in Simplenote and other places at Automattic where we've deployed this patch and it's been a few years now, so I think whatever practical concerns there may be with this approach, it's clearly better than master and clearly not introducing surprise gotchas that will amount to anything near the existing problems. in other words, we haven't found any problems with it even though there could be something in this I've missed.

if there are problems with the approach in this PR, they are so rare that it's not comparable to the wildly-common bugs present in master.

Resolves google#69 for JavaScript Sometimes we can find a common prefix that runs into the middle of a surrogate pair and we split that pair when building our diff groups. This is fine as long as we are operating on UTF-16 code units. It becomes problematic when we start trying to treat those substrings as valid Unicode (or UTF-8) sequences. When we pass these split groups into `toDelta()` we do just that and the library crashes. In this patch we're post-processing the diff groups before encoding them to make sure that we un-split the surrogate pairs. The post-processed diffs should produce the same output when applying the diffs. The diff string itself will be different but should change that much - only by a single character at surrogate boundaries.

Resolves google#69 for Java Sometimes we can find a common prefix that runs into the middle of a surrogate pair and we split that pair when building our diff groups. This is fine as long as we are operating on UTF-16 code units. It becomes problematic when we start trying to treat those substrings as valid Unicode (or UTF-8) sequences. When we pass these split groups into `toDelta()` we do just that and the library crashes. In this patch we're post-processing the diff groups before encoding them to make sure that we un-split the surrogate pairs. The post-processed diffs should produce the same output when applying the diffs. The diff string itself will be different but should change that much - only by a single character at surrogate boundaries.

Resolves google#69 for Objective-C Sometimes we can find a common prefix that runs into the middle of a surrogate pair and we split that pair when building our diff groups. This is fine as long as we are operating on UTF-16 code units. It becomes problematic when we start trying to treat those substrings as valid Unicode (or UTF-8) sequences. When we pass these split groups into `toDelta()` we do just that and the library crashes. In this patch we're post-processing the diff groups before encoding them to make sure that we un-split the surrogate pairs. The post-processed diffs should produce the same output when applying the diffs. The diff string itself will be different but should change that much - only by a single character at surrogate boundaries.

Resolves google#69 for Python2 Sometimes we can find a common prefix that runs into the middle of a surrogate pair and we split that pair when building our diff groups. This is fine as long as we are operating on UTF-16 code units. It becomes problematic when we start trying to treat those substrings as valid Unicode (or UTF-8) sequences. When we pass these split groups into `toDelta()` we do just that and the library crashes. In this patch we're post-processing the diff groups before encoding them to make sure that we un-split the surrogate pairs. The post-processed diffs should produce the same output when applying the diffs. The diff string itself will be different but should change that much - only by a single character at surrogate boundaries.

Resolves google#69 for Python3 Sometimes we can find a common prefix that runs into the middle of a surrogate pair and we split that pair when building our diff groups. This is fine as long as we are operating on UTF-16 code units. It becomes problematic when we start trying to treat those substrings as valid Unicode (or UTF-8) sequences. When we pass these split groups into `toDelta()` we do just that and the library crashes. In this patch we're post-processing the diff groups before encoding them to make sure that we un-split the surrogate pairs. The post-processed diffs should produce the same output when applying the diffs. The diff string itself will be different but should change that much - only by a single character at surrogate boundaries.

dmsnell · 2024-01-31T00:06:12Z

All: I have merged this into my fork and it's now in the master branch.
I'd welcome help in porting the fix to the other libraries, or in rewriting the fix to use the approach taken by dissimilar.

Unfortunately I won't have a lot of time to actively maintain the fork, but I will do my best to be responsive and help merge in updates.

dmsnell mentioned this pull request Nov 9, 2019

JS: Handle surrogate pairs correctly #69

Open

dmsnell changed the title ~~JavaScript: Stop breaking surrogate pairs in toDelta()~~ Stop breaking surrogate pairs in toDelta() Nov 9, 2019

dmsnell force-pushed the issues/69-broken-surrogate-pairs branch from 01426e2 to 71646fb Compare November 9, 2019 21:07

dmsnell mentioned this pull request Nov 12, 2019

Fix: Stop splitting surrogate-pairs in diff-match-patch deltas Simperium/node-simperium#91

Merged

dmsnell mentioned this pull request Nov 13, 2019

Fix: No more crashing/skipped changes for certain changes Automattic/simplenote-electron#1714

Merged

vicmeow mentioned this pull request Dec 1, 2019

[studio] Emoji in string field sometimes crashes document sanity-io/sanity#1643

Closed

dmsnell force-pushed the issues/69-broken-surrogate-pairs branch 2 times, most recently from fbb8c73 to 55a8779 Compare December 13, 2019 06:47

dmsnell changed the title ~~Stop breaking surrogate pairs in toDelta()~~ Stop breaking surrogate pairs in toDelta()/fromDelta() Dec 13, 2019

theck13 mentioned this pull request Dec 27, 2019

Diff Match Patch Simperium/simperium-android#215

Merged

dmsnell mentioned this pull request Jan 7, 2020

Specify indexing/length units #83

Open

dmsnell force-pushed the issues/69-broken-surrogate-pairs branch from bdc4740 to fa122d3 Compare January 14, 2020 01:49

dmsnell mentioned this pull request Feb 5, 2020

Fix: Second update to diff-match-patch fix Simperium/node-simperium#97

Merged

valr added a commit to valr/website that referenced this pull request May 3, 2021

include google/diff-match-patch#80

846f6c3

This was referenced May 24, 2021

Use UTF-32 Aware String in JavaScript #13

Open

Diff breaks unicode characters for emojis #59

Open

dmsnell mentioned this pull request Jul 19, 2021

Port diff_match_patch into Typescript #74

Open

3 tasks

michal-kurz mentioned this pull request Oct 4, 2022

Stop breaking surrogate pairs in toDelta()/fromDelta() feedyou-ai/diff-match-patch#1

Merged

dmsnell mentioned this pull request Oct 4, 2022

Fix surrogate pairs splitting in toText()/fromText() feedyou-ai/diff-match-patch#2

Merged

michal-kurz mentioned this pull request Oct 4, 2022

Library throws "URI malformed" error when creating patch with emojis JackuB/diff-match-patch#22

Open

dmsnell mentioned this pull request Mar 21, 2023

Suggested fix for NullReferenceException in C# #140

Open

stanley2058 mentioned this pull request May 11, 2023

fix/stop breaking surrogate pairs hackmdio/diff-match-patch#2

Merged

dmsnell mentioned this pull request May 16, 2023

Project abandoned / dead / maintained? #143

Open

carlgieringer mentioned this pull request Jul 2, 2023

Ensure that anchors work with multibyte Unicode Howdju/howdju#441

Open

dmsnell mentioned this pull request Jan 4, 2024

Patch margin index splits Unicode surrogate pairs unexpectedly #149

Open

dmsnell added 5 commits January 30, 2024 16:51

dmsnell force-pushed the issues/69-broken-surrogate-pairs branch from d681ef6 to 50f1542 Compare January 30, 2024 23:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop breaking surrogate pairs in toDelta()/fromDelta() #80

Stop breaking surrogate pairs in toDelta()/fromDelta() #80

dmsnell commented Nov 9, 2019 •

edited

dmsnell commented Dec 5, 2019

dmsnell commented Dec 13, 2019 •

edited

NeilFraser commented Dec 17, 2019

dmsnell commented Dec 17, 2019 •

edited

dmsnell commented Dec 20, 2019

salzhrani commented Jan 14, 2020

dmsnell commented Mar 13, 2020

dmsnell commented Mar 29, 2020

mustafamunawar commented Dec 21, 2020 •

edited

dmsnell commented Dec 21, 2020

dmsnell commented May 22, 2021

dmsnell commented May 26, 2022

gamedevsam commented Aug 24, 2022

dmsnell commented Sep 10, 2022

michal-kurz commented Oct 4, 2022 •

edited

dmsnell commented Oct 4, 2022

michal-kurz commented Oct 4, 2022 •

edited

dmsnell commented Oct 4, 2022

michal-kurz commented Oct 4, 2022

michal-kurz commented Oct 6, 2022 •

edited

dmsnell commented May 10, 2023

dmsnell commented Jan 31, 2024

Stop breaking surrogate pairs in toDelta()/fromDelta() #80

Are you sure you want to change the base?

Stop breaking surrogate pairs in toDelta()/fromDelta() #80

Conversation

dmsnell commented Nov 9, 2019 • edited

Status

Todo:

Major changes

dmsnell commented Dec 5, 2019

dmsnell commented Dec 13, 2019 • edited

NeilFraser commented Dec 17, 2019

dmsnell commented Dec 17, 2019 • edited

dmsnell commented Dec 20, 2019

salzhrani commented Jan 14, 2020

dmsnell commented Mar 13, 2020

dmsnell commented Mar 29, 2020

mustafamunawar commented Dec 21, 2020 • edited

dmsnell commented Dec 21, 2020

dmsnell commented May 22, 2021

dmsnell commented May 26, 2022

gamedevsam commented Aug 24, 2022

dmsnell commented Sep 10, 2022

michal-kurz commented Oct 4, 2022 • edited

dmsnell commented Oct 4, 2022

michal-kurz commented Oct 4, 2022 • edited

dmsnell commented Oct 4, 2022

michal-kurz commented Oct 4, 2022

michal-kurz commented Oct 6, 2022 • edited

dmsnell commented May 10, 2023

dmsnell commented Jan 31, 2024

dmsnell commented Nov 9, 2019 •

edited

dmsnell commented Dec 13, 2019 •

edited

dmsnell commented Dec 17, 2019 •

edited

mustafamunawar commented Dec 21, 2020 •

edited

michal-kurz commented Oct 4, 2022 •

edited

michal-kurz commented Oct 4, 2022 •

edited

michal-kurz commented Oct 6, 2022 •

edited