Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Myers Heuristics #15

Open
mitsuhiko opened this issue Feb 17, 2021 · 4 comments
Open

Implement Myers Heuristics #15

mitsuhiko opened this issue Feb 17, 2021 · 4 comments

Comments

@mitsuhiko
Copy link
Owner

mitsuhiko commented Feb 17, 2021

GNU diff and others have some internal heuristics to bail if there are too many changes. There are basically two optimizations:

  • Discard lines which are completely distinct between two files: this has the advantage that they are entirely ignored by the diffing algorithm which helps to reduce how much has to be diffed greatly
  • Bail if the snake becomes too long

https://github.com/reviewboard/reviewboard/blob/master/reviewboard/diffviewer/myersdiff.py

To be more aligned with git it might make sense to implement the heuristics in the current Algorithm::Myers variant and have a secondary Algorithm::MyersMinimal which has these heuristics disabled (git calls the variants myers and minimal).

These heuristics are likely needed as currently lcs outperform myers greatly if used on completely distinct files.

@potocpav
Copy link

potocpav commented Dec 8, 2021

I found a way to easily implement the distinct-line heuristic. It doesn't modify the Myers algorithm at all, and rather builds on top of it. The algorithm works in three steps:

1. Encode the input lists of type Vec<T> into an "optimized" representation:

enum Elems<T> {
    UniqueRun(Vec<T>),
    NormalElem(T),
}

In this representation, consecutive unique lines are concatenated into UniqueRun, and non-unique lines are just inside NormalElem. The whole sequence is then Vec<Elems<T>>.

2. Perform Meyers diff on Vec<Elems<T>>, instead of Vec<T> directly.

3. Decode Vec<Elems<T>> back into Vec<T> to get the result.

This solves the super common pathological case of nearly-distinct files. However, if unique and non-unique lines are mixed together, it still fails.

I implemented this algorithm for the Pijul crate. I don't know how easy/hard it would be to adapt for this crate. Just leaving this here in case it's useful.

@mitsuhiko
Copy link
Owner Author

Thank you for that @potocpav. I will have a look and evaluate this. The underlying design is still somewhat similar to pijul so it should be easy enough to adapt.

@P-E-Meunier
Copy link
Collaborator

More specifically, this project started as a fork of "diffs", right?

@mitsuhiko
Copy link
Owner Author

Yep. See also #1

Wilfred added a commit to Wilfred/difftastic that referenced this issue Jan 15, 2023
This substantially improves performance on text files where there are
few lines in common.

For example, 10,000 line files with no lines in common is more than 10x
faster (8.5 seconds to 0.49 seconds on my machine), and
sample_files/huge_cpp_before.cpp is nearly 2% faster.

Fixes the case mentioned by @quackenbush in #236.

This is inspired by the heuristics discussions at
mitsuhiko/similar#15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants