Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tree -> index diff for status #1363

Merged
merged 9 commits into from
May 14, 2024
Merged

tree -> index diff for status #1363

merged 9 commits into from
May 14, 2024

Conversation

Byron
Copy link
Owner

@Byron Byron commented May 7, 2024

Based on #1350


diff-correctness → gix-status → gix reset


Improve gix status to the point where it's suitable for use in reset functionality.
Leads to a proper worktree reset implementation, eventually leading to a high-level reset similar to how git supports it.

Architecture

The reason this PR deals quite a bit with gix status is that for a safe implementation of reset() we need to be sure that the files we would want to touch don't don't carry modifications or are untracked files. In order to know what would need to be done, we have to diff the current-index with target-index. The set of files to touch can then be used to lookup information provided by git-status, like worktree modifications, index modifications, and untracked files, to know if we can proceed or not. Here is also where the reset-modes would affect the outcome, i.e. what to change and how.

This is a very modular approach which facilitates testing and understanding of what otherwise would be a very complex algorithm. Having a set of changes as output also allows to one day parallelize applying these changes.

This leaves us in a situation where the current checkout() implementation wants to become a fastpath for situations where the reset involves an empty tree as source (i.e. create everything and overwrite local changes).

Extra Tasks

Out-of-band tasks that just should finally be done, with potential for great impact.

  • support for hasconfig as part of resolve_includes() without actual lookahead.

Tasks

  • diff index with index to learn what we would want to do in the worktree, or alternatively,
    diff tree with index (with reverse-diff functionality to simulate diff of index with tree), for better performance as it
    would avoid having to allocate a whole index even though we are only interested in a diff.
    • Must include rename tracking.
  • how to make diff results available from status with all transformations applied, to allow user to obtain diffs of any kind? Compare to tree-tree-diff which has that functionality already.
  • update is_dirty() and Submodule::status() to do full status.

Status Enables

  • cargo package and its use of complete status information.
  • Add gitoxide backend built#1
  • starship native dirty check (but needs diffstats of worktree-vs-index)
  • built can get fully-functional is-dirty flag for 'describe()'

Inbetween

Next PR: Reset

  • reset() that checks if it's allowed to perform a worktree modification is allowed, or if an entry should be skipped. That way we can postpone safety checks like --hard

Postponed

What follows is important for resets, but won't be needed for cargo worktree resets.

  • a way to expand sparse dirs (but figure out if this is truly always necessary) - probably not, unless sparse dirs can be empty, but even then no expansion is needed
    • wire it up in gix index entries to optionally expand sparse entries
  • gix status with implemented 'porcelain-v2` display mode

Research

  • Ignored files are considered expandable and can be overwritten on reset
  • How to integrate submodules - probably easy to answer once gix status can deal a little better with submodules. Even though in this case a lot of submodule-related information is needed for a complete reset, probably only doable by a higher-level caller which orchestrates it.
  • How to deal with various modes like merge and keep? How to control refresh? Maybe partial (only the files we touch), and full, to also update the files we don't touch as part of status? Maybe it's part of status if that is run before.
  • Worthwhile to make explicit the difference between git reset and git checkout in terms of HEAD modifications. With the former changing HEADs referent, and the latter changing HEAD itself.
  • figure out how this relates to the current checkout() method as technically that's a reset --hard with optional overwrite check. Could it be rolled into one, with pathspec support added?
    • just keep them separate until it's clear that reset() performs just as well, which is unlikely as there is more overhead. But maybe it's not worth to maintain two versions over it. But if so, one should probably rename it.
  • for git status: what about rename tracking? It's available for tree-diffs and quite complex on its own. Probably only needs HEAD-vs-index rename tracking. No, also can have worktree rename tracking, even though it's hard to imagine how this can be fast unless it's tightly integrated with untracked-files handling. This screams for a generalization of the tracking code though as the testing and implementation is complex, but should be generalisable.

Re-learn

  • pathspecs normalize themselves to turn from any kind of specification into repo-root relative patterns.
  • attribute/ignore file sources are naturally relative to the root of the repo, which remains relative (i.e. can be .. and that root will be always be used to open files like ../.gitignore, which is useful for display to the user)

turns out that Powershell is used by default there, and variable substitutions
are different there.
What's worse is that it doesn't seem to fail if something doesn't work as one
would expect.
@EliahKagan
Copy link
Contributor

Regarding 91b6549, is it required to use PowerShell? Otherwise, shell: bash can be used, even on Windows, which gives Git Bash, so anything that is due to PowerShell rather than a POSIX-compatible shell being used may be fixable that way.

@Byron
Copy link
Owner Author

Byron commented May 8, 2024

Thanks for the reminder! Yes, that would have worked too. Overall, I think using the ${{ env.var }} syntax is more portable and should be preferred. There might be other cases where it's definitely better to change the shell though.

@EliahKagan
Copy link
Contributor

Yes, there is nothing wrong with ${{ env.var }}. The only reason I mentioned shell: bash was this bit from the commit message:

What's worse is that it doesn't seem to fail if something doesn't work as one
would expect.

Achieving the effect of set -e or set -u is less obvious and automatic in PowerShell than in POSIX-compatible shells, so I figured the ability to use bash might be relevant there.

@Byron
Copy link
Owner Author

Byron commented May 8, 2024

Achieving the effect of set -e or set -u is less obvious and automatic in PowerShell than in POSIX-compatible shells, so I figured the ability to use bash might be relevant there.

And after writing the bit about powershell I also realized that I am so used to set -euo pipefail that maybe I forgot how bash would have done the same in that situation. But I am not sure how the defaults are set on CI, where one would hope a failing command indeed triggers failure by default.

Previousoly it would make an additional syscall to change permissions, which
is slower than necessary.
@Byron Byron force-pushed the status branch 5 times, most recently from 9ca81df to 7c9186a Compare May 11, 2024 19:39
…ng a single index.

The motivating example is here: praetorian-inc/noseyparker#179

Previously, it was possible for a trampling herd of threads to consolidate the
disk state. Most of them would be 'needs-init' threads which could notice that
the initialization already happened, and just use that.

But a thread might be late for the party and somehow manages to not get any
newly loaded index, and thus tries to consolidate with what's on disk again.
Then it would again determine no change, and return nothing, causing the caller
to abort and not find objects it should find because it wouldn't see the index
that it should have seen.

The reason the thread got into this mess is that the 'is-load-ongoing' flagging
was racy itself, so it would not wait for ongoing loads and just conclude nothing
happened. An extra delay (by yielding) now assures it either seees the loading state
and waits for it, sees the newly loaded indices.

Note that this issue can be reproduced with:

```
'./target/release/gix -r repo-with-one-pack -t10 --trace odb stats --extra-header-lookup'
```
Related to helix-editor/helix#10660
which runs into object types that are unknown.

I have looked into this and [couldn't find evidence of a new pack-entry type](https://github.com/git/git/blob/0f3415f1f8478b05e64db11eb8aaa2915e48fef6/packfile.c#L1303-L1358)
in the Git codebase.

It also looks like that Git [will never write packs that aren't V2](https://github.com/git/git/blob/0f3415f1f8478b05e64db11eb8aaa2915e48fef6/pack-write.c#L352)
 - initially I thought it might write V3 based on some other criteria.

For now, the thesis is that one would have to be able to mark bad objects
to be able to handle this more gracefully, but all we can do is try to fail.
@Byron Byron force-pushed the status branch 5 times, most recently from 25e43ec to db2acf9 Compare May 13, 2024 12:29
@Byron Byron linked an issue May 13, 2024 that may be closed by this pull request
@Byron Byron merged commit 04ef31e into main May 14, 2024
19 checks passed
@Byron Byron mentioned this pull request May 14, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants