Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink parsing, caching, and refreshing #2985

Open
tarsius opened this issue Feb 3, 2017 · 5 comments
Open

Rethink parsing, caching, and refreshing #2985

tarsius opened this issue Feb 3, 2017 · 5 comments
Labels
area: abstraction enhancement New feature or request

Comments

@tarsius
Copy link
Member

tarsius commented Feb 3, 2017

For a variety of reasons, not least in order to improve performance, it is necessary to rethink how Magit parses and caches complex git output, and how sections that are created from such output are being updated and cached.

Here I will discuss this with a focus on git diff output. Other "complex git output" includes that of git log and git blame. I am excluding git log from the discussion here because I am undecided on whether I want to continue to parse the output of that sub-command or switch to using libgit2 for logs.

"Non-complex git output" is output that can easily be turned into a list of items or a single value, i.e. lists and values that we in the future want to get by calling a libgit2 function. Some of the things said below also apply to such "non-complex" output, for example because the output is simple, it might never-the-less be expensive to calculate it, so doing it asynchronously would be desirable here also.


These are some things I want to change:

  1. All parsing of git output should happen in a separate buffer, not in the buffer into which the parsed output will ultimately be inserted (a process currently known as "washing").

This has a few benefits:

  • We know whether there actually is anything to be inserted before we start doing so. Currently it can happen that a section has to be "canceled", after its heading was already inserted.

  • Once we also start updating sections asynchronously, not doing the parsing in a separate buffer would mean that multiple processes may be inserting into the same buffer at the same time.

  • Once we start using Emacs' upcoming thread support, we can only take full advantage of that if the various threads operate in separate buffers.

  • The output of multiple git invocations, as well as other information, can more easily be combined into one amalgamation. Doing it in separate buffers will also make it easier to delay the "refinement" steps.

    For example we might want to combine the output of git diff and git diff --word-diff (see {TODO separate issue}), and also the syntax-highlighting extracted from the corresponding file-/blob-visiting buffers (see Diffs should optionally support syntax highlighting #2942).

  • The parsing buffers can be used as a cache.

  • Using a separate buffer that the user never sees means that it can contain partially parsed output, which doesn't even have to be hidden.

  1. All parsing should happen asynchronously.

On the command line git diff gives immediate results, even for huge changes. That's because only enough differences have to be formatted to fill the screen. The rest is calculated asynchronously and by the time the user starts scrolling down, it is ready.

Magit on the other hand currently runs git diff synchronously, which means that all differences have to be calculated and formatted by git up-front. Magit already delays parsing the git output, by leaving hunk bodies "unwashed" until they are expanded, but that is not enough. The "washing" is slow and it helps to delay that, but for large changes it also takes git itself a long time to calculate and format the differences.

  1. Allow outdated or partially formatted information.

We can't really do (3) without this.

For example allow parts of a diff to only list the changed files, but not yet the hunk headings and their content. Do so by first running git diff --raw first, then git diff -- FILE for the parts that are visible and expanded, then for parts that are expanded but not visible, and finally for the parts that are not visible and not expanded.

Here "invisible" means part of the buffer that don't fall into the region from the window beginning to the window end.

  • Do this on the buffer level also. If the status buffer is not visible, then it does not have to be updated immediately. But when it does become visible then it should immediately be updated (as described above).

  • This mainly concerns diffs, lists of untracked files (which can be very expensive to calculate in large repositories with bad ignore rules) and maybe logs. Other information, mainly things presented in the header, but also stashes (which are cheap to list, and you really don't want to drop the wrong one because the numbering has changed, but not in the ui) should still be updated synchronously. In rare cases that could lead to a noticeable delays, but better safe than sorry (and complicated).

  • Mark sections that are outdated, maybe even using a spinner (unless that turns out to be too costly). E.g. if a section ought to be expanded (because it was before the refresh), but isn't because git diff hasn't been run without --raw for the respective file yet, then this has to be visualized somehow.

  • In practice I expect users to not notice that the buffer has not been fully updated yet in most cases. The yet to be updated parts should usually not be visible in the window and unless the user starts to immediately scroll, they will be updated by the time s/he does.

  1. Define sections as objects.

    • Currently Magit uses CL structs, in the future we will use CLOS (Eieio) objects. Structs are insufficient because, among other things, no sub-classing is possible. Currently all section types are represented using the same struct type, which has a type slot. That makes dispatching much harder and we currently use some ugly special purpose dispatch macros to do so. Using CLOS will allow the use of generic functions as a dispatch mechanism.
  2. Because there currently is no sub-classing, each sub-type has to handled explicitly by each and every function or command that deals with a certain "section type". As a result we have been avoiding sub-types.

    • This is an even bigger issue for third-party extensions, as they cannot just add a sub-type without coordinating with us. We would have to adjust each an every function that deals with the base type, just because some extension needs a specialized variant.

    • Using generic functions will also make it easier for users to customize behavior, but only in certain contexts. Eieio is actually more flexible than CLOS in that it can dispatch on things other than arguments. It would e.g. be possible to dispatch based on the major-mode of the current buffer. Or the parent section.

  3. Cache sections and buffers.

    • Currently Magit caches calls to git during a refresh. This cache is mainly useful because it reduces a very large amount of calls to git to a manageable amount of calls. This cache can likely be removed once we start using libgit2. Note that calls to git diff and git log are not cached, and those are the most expensive ones. Also this caches git output, not the sections creates from that output, and that's also expensive, especially for diffs.

    • Caching sections avoids more work than caching output. It should also not be confused with not updating sections that don't have to be updated (see below).

    • An example where buffer caching is useful is showing one rev in the rev buffer, then another, and then the previous one again. Currently showing the same rev again requires doing it from scratch again. In this particular case it is best to cache the complete buffer. I have to investigate whether buffer-swap-text could be used for that purpose.

    • Currently a section is only associated with the buffer in which it is shown. Actually it is text in that buffer which is associated with the section object. Section objects also contain slots for their parents and children. But there is no association between a section and the raw data it represents.

    • That has some drawbacks, for example when applying a hunk, the actual hunk text has to be reconstructed from text in the display buffer as well as information stored in the object (the header part, including line numbers but also permission changes and such).

    • If the raw diff were preserved in a parsing buffer and a hunk object contained a reference to the relevant part of that buffer, then things would get easier.

  4. Don't recreate sections that don't need to be recreated.

    • For example log sections don't have to be updated after staging a change - staging doesn't create commits, so logs stay the same.

    • When a certain section has to be updated and when not, has to be defined somewhere. I tend to think that this is the responsibility of the section classes, not the commands that use them.

  5. I have not yet decided on a design pattern. I will have to read up the various patterns and am actually quite excited to do that. Above I mostly describe issues and very generic ideas on how to tackle those. Now I have to do some research to avoid known pitfalls. All I know for sure is that I want to use the currently used "dump text into the visible buffer and bang it into shape in place" pattern that is currently being used ;-) Oh, and I am rather certain that this will involve some OO approach, because, as hinted at above, almost completely avoiding that leads to some very serious issues.

    • I will start with a basic poc implementation, which does not support all the features needed by Magit. This will allow experimenting with various approaches before settling on one, and reduces the risk of sticking to one because a lot of work already went into it.
  6. If an unexpected error occurs when running git to produce some output, then display that error instead of failing silently.

  7. A DOM is on the table.

@alphapapa
Copy link
Contributor

Hi Jonas,

Forgive the noise, but after making magit-todos, I was thinking that the way the section code works could be very useful in other packages. For example, imagine if the Org Agenda used Magit sections: it would be easy to collapse them, count items, etc. From what I've seen so far, Magit sections are much nicer to use than manually working with text properties. I could also imagine using it in our matrix-client package to display things.

So, have you thought about factoring out the sections into some kind of library that could be used by other packages? Forgive me if you've discussed this elsewhere, but this seems like an appropriate issue to mention it on.

Thanks.

@brotzeit
Copy link
Contributor

There's already an open issue #2956

@alphapapa
Copy link
Contributor

Thank you! I thought I had seen something like it before, but I couldn't remember...

@marc-h38
Copy link

marc-h38 commented Sep 28, 2021

Once we also start updating sections asynchronously, not doing the parsing in a separate buffer would mean that multiple processes may be inserting into the same buffer at the same time.

Can you confirm "at the same time" implies running multiple git commands concurrently? Asynchronous != concurrent.

I'm using magit over tramp the vast majority of the time and after some basic tracing I came to the conclusion that (when using tramp) network latency is the main performance issue - as it most often is. This is of course especially true with remote systems located on other continents ( 100-200ms) but even for closer ones network latency is most likely the bottleneck. Moreover close systems can often "appear" far away because of https://en.wikipedia.org/wiki/Bufferbloat (especially on wifi) and other semi-hidden latency issues that (highly parallelized) web browsers don't notice.

Depending on your magit configuration, the status buffer seems to require between 10 and 20 git commands. When each git command must pay a 200ms network round trip, each status update takes at least 2 seconds.

The only way to solve network latency issues like this is to run multiple git commands in parallel (or fewer of them). As already alluded to, this is exactly what web browsers and others do. Besides increasing the speed of light, there is simply no other way. EDIT: and it leverages multiple cores too; even locally.

This and #2982 describe a switch to libgit2 but I understand this won't apply to tramp, correct?

PS: reproducing latency issues even when you don't have any is straight forward with Linux's netem driver:
https://www.pico.net/kb/how-can-i-simulate-delayed-and-dropped-packets-in-linux/
https://unix.stackexchange.com/questions/432925/netem-how-to-delay-packets-sent-to-received-from-some-host This is also a simple way to "amplify" the cost of running too many git commands and notice it immediately without any actual tracing. It works on the loopback interface / localhost too.

PPS: of course I'm not masochistic and I'm trying to use far away systems as rarely as possible but sometimes I have to + again latency happens with closer systems too; looking at you bufferbloated wifi driver.

@tarsius
Copy link
Member Author

tarsius commented Sep 28, 2021

Can you confirm "at the same time" implies running multiple git commands concurrently?

Confirmed.

This and #2982 describe a switch to libgit2 but I understand this won't apply to tramp, correct?

Correct.

Tramp isn't a priority. It's nice that it works and all (even if slow), but improving local "performance" is the main focus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: abstraction enhancement New feature or request
Development

No branches or pull requests

4 participants