Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add cleanup sub-command that remove local bugs and identities #933

Merged
merged 3 commits into from Jan 14, 2023

Conversation

zinderic
Copy link
Contributor

Related to #676

I've added subcommand called "cleanup" which removes local bugs and identities.

Copy link
Owner

@MichaelMure MichaelMure left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too bad really, but need some adjustment.
Thanks for doing this.

entity/dag/entity_actions.go Outdated Show resolved Hide resolved
entity/dag/entity_actions_test.go Outdated Show resolved Hide resolved
commands/cleanup.go Outdated Show resolved Hide resolved
commands/cleanup.go Outdated Show resolved Hide resolved
@MichaelMure
Copy link
Owner

Looking at your PR, I realized that it should be up to the cache layer to update or clear the full-text index (Bleve), meaning that those commands should go through the cache, not directly to the repository.

As it turns out, I'm in the process of refactoring that cache (#938), so I'll come back to your PR when I'm done with that. Don't worry it will be merged, it just will require a bit more time.

@zinderic
Copy link
Contributor Author

zinderic commented Dec 8, 2022

Okay, thanks for letting me know.

@MichaelMure
Copy link
Owner

PR rebased and adjusted:

  • follow my previous changes in the cache
  • also cleanup the git config (user identity, bridge ...)
  • also cleanup the cache files

Thanks for your help :-)

@MichaelMure
Copy link
Owner

hmm, it seems like I unearthed some issue around deleting files on windows. The CI fails with things like:

    repo_cache_test.go:159: 
        	Error Trace:	D:\a\git-bug\git-bug\cache\repo_cache_test.go:159
        	Error:      	Received unexpected error:
        	            	CreateFile C:\Users\RUNNER~1\AppData\Local\Temp\TestCache1445993478\001\.git\refs\bugs\51932f5d173e9965e67256a5deb8300e91b70f4b1f029331268f66088289086a: The system cannot find the file specified.
        	Test:       	TestCache
        	```

The code do looks good though, including in go-git where things seem to be idempotent ... 

@zinderic
Copy link
Contributor Author

Yeah, I don't think it's specific to Windows. I don't have access to a Windows right now but I managed to reproduce the error on my Mac:

kstaykov:git-bug/ (feat-cleanup-subcommand✗) $ go test -v -run TestCache -count=1 ./cache                                                      [16:31:32]
=== RUN   TestCache
    repo_cache_test.go:159: 
                Error Trace:    /Users/kstaykov/proj/git-bug/cache/repo_cache_test.go:159
                Error:          Received unexpected error:
                                open /var/folders/d6/y7xbyvtj337f8njxpkq7kk9c0000gn/T/TestCache3212261930/001/.git/refs/bugs/e87867963cb6e73d40fe9ab0cee9942d1827839610dbbf76019ae395e30c4c41: no such file or directory
                Test:           TestCache
--- FAIL: TestCache (0.24s)
=== RUN   TestCachePushPull
--- PASS: TestCachePushPull (0.41s)
=== RUN   TestCacheEviction
--- PASS: TestCacheEviction (0.14s)
FAIL
FAIL    github.com/MichaelMure/git-bug/cache    1.262s
FAIL
kstaykov:git-bug/ (feat-cleanup-subcommand✗) $  

It's not failing every time but it fails often, about half the time. I still don't have much luck identifying the cause but it appears like some sort of race condition that is trying to access the file before it is there. It feels like the file is being created by the logic but it was not synced to the disk yet and unavailable for read at the time it is being accessed. But that's just a theory right now.

@zinderic
Copy link
Contributor Author

I see what's going on. It's trying to remove Identity and Bug subcaches. Since that's done using multierr.ErrWaitGroup it will run it in goroutine and therefore there is no guaranteed order for which one would go first. I don't think that's a problem but when they go concurrently there is a chance of the cache becoming "broken" or in a state where it will be rebuilt.

I tried patching my function to not use multierr.ErrWaitGroup and go sequentially and return first error. That way the test no longer fails even when executed many times. But if I make that patch we might be hiding a problem with the cache. What do you think?

@MichaelMure
Copy link
Owner

So you are suggesting that there is a concurrent access to something somewhere? If so that's indeed a bug, and concurrently calling RemoveAll() on different entities should work. multierr.ErrWaitGroup should stay.

@zinderic
Copy link
Contributor Author

Yes, there is a race condition that is not detectable in the current test for some reason but when I make a new test that adds 2 identities and 2 bugs and tries to concurrently remove them the go tooling is able to detect the race condition. Here's the test:

func TestCacheConcurrent(t *testing.T) {
	repo := repository.CreateGoGitTestRepo(t, false)
	cache, err := NewRepoCacheNoEvents(repo)
	require.NoError(t, err)
	// Create, set and get user identity
	iden1, err := cache.Identities().New("Kaz", "kaz@ketterdam.grisha")
	require.NoError(t, err)
	err = cache.SetUserIdentity(iden1)
	require.NoError(t, err)
	iden2, err := cache.Identities().New("Inej", "inej@ketterdam.grisha")
	require.NoError(t, err)

	bug1, _, err := cache.Bugs().New("title", "message")
	require.NoError(t, err)
	bug2, _, err := cache.Bugs().New("title", "message")
	require.NoError(t, err)

	go func() {
		cache.Identities().Remove(iden1.Id().String()[:10])
		cache.Identities().Remove(iden2.Id().String()[:10])
		cache.Bugs().Remove(bug1.Id().String()[:10])
		cache.Bugs().Remove(bug2.Id().String()[:10])
	}()

	require.NoError(t, err)
}

This test will reliably fail and the race condition is now detectable:

kstaykov:git-bug/ (feat-cleanup-subcommand✗) $ go test -race -v -run TestCacheConcurrent -count=1 ./cache                                      [10:49:18]
=== RUN   TestCacheConcurrent
==================
WARNING: DATA RACE
Read at 0x00c000423d10 by goroutine 29:
  runtime.mapaccess1_faststr()
      /opt/homebrew/Cellar/go/1.19.3/libexec/src/runtime/map_faststr.go:13 +0x41c
  github.com/MichaelMure/git-bug/repository.(*GoGitRepo).GetIndex()
      /Users/kstaykov/proj/git-bug/repository/gogit.go:330 +0xac
  github.com/MichaelMure/git-bug/repository.(*replaceKeyring).GetIndex()
      <autogenerated>:1 +0x5c
  github.com/MichaelMure/git-bug/cache.(*SubCache[...]).Remove()
      /Users/kstaykov/proj/git-bug/cache/subcache.go:395 +0x2f8
  github.com/MichaelMure/git-bug/cache.TestCacheConcurrent.func1()
      /Users/kstaykov/proj/git-bug/cache/repo_cache_test.go:186 +0xe8

Previous write at 0x00c000423d10 by goroutine 11:
  runtime.mapassign_faststr()
      /opt/homebrew/Cellar/go/1.19.3/libexec/src/runtime/map_faststr.go:203 +0x42c
  github.com/MichaelMure/git-bug/repository.(*GoGitRepo).Close()
      /Users/kstaykov/proj/git-bug/repository/gogit.go:225 +0xa4
  github.com/MichaelMure/git-bug/repository.CreateGoGitTestRepo.func1()
      /Users/kstaykov/proj/git-bug/repository/gogit_testing.go:35 +0x2c
  testing.(*common).Cleanup.func1()
      /opt/homebrew/Cellar/go/1.19.3/libexec/src/testing/testing.go:1041 +0x13c
  testing.(*common).runCleanup()
      /opt/homebrew/Cellar/go/1.19.3/libexec/src/testing/testing.go:1210 +0xe4
  testing.tRunner.func2()
      /opt/homebrew/Cellar/go/1.19.3/libexec/src/testing/testing.go:1440 +0x48
  runtime.deferreturn()
      /opt/homebrew/Cellar/go/1.19.3/libexec/src/runtime/panic.go:476 +0x30
  testing.(*T).Run.func1()
      /opt/homebrew/Cellar/go/1.19.3/libexec/src/testing/testing.go:1493 +0x40

Goroutine 29 (running) created at:
  github.com/MichaelMure/git-bug/cache.TestCacheConcurrent()
      /Users/kstaykov/proj/git-bug/cache/repo_cache_test.go:185 +0x3d8
  testing.tRunner()
      /opt/homebrew/Cellar/go/1.19.3/libexec/src/testing/testing.go:1446 +0x188
  testing.(*T).Run.func1()
      /opt/homebrew/Cellar/go/1.19.3/libexec/src/testing/testing.go:1493 +0x40

Goroutine 11 (running) created at:
  testing.(*T).Run()
      /opt/homebrew/Cellar/go/1.19.3/libexec/src/testing/testing.go:1493 +0x55c
  testing.runTests.func1()
      /opt/homebrew/Cellar/go/1.19.3/libexec/src/testing/testing.go:1846 +0x90
  testing.tRunner()
      /opt/homebrew/Cellar/go/1.19.3/libexec/src/testing/testing.go:1446 +0x188
  testing.runTests()
      /opt/homebrew/Cellar/go/1.19.3/libexec/src/testing/testing.go:1844 +0x6c0
  testing.(*M).Run()
      /opt/homebrew/Cellar/go/1.19.3/libexec/src/testing/testing.go:1726 +0x870
  main.main()
      _testmain.go:59 +0x2fc
==================
    testing.go:1319: race detected during execution of test
--- FAIL: TestCacheConcurrent (0.16s)
=== CONT  
    testing.go:1319: race detected during execution of test
FAIL
FAIL    github.com/MichaelMure/git-bug/cache    0.762s
FAIL
kstaykov:git-bug/ (feat-cleanup-subcommand✗) $ 

@MichaelMure
Copy link
Owner

@zinderic I looked at it and the race you found is because you don't wait for your goroutine in your test, so you have Remove() being called at the same time as Close() on the cache (due to reaching the end of the test). Arguably Close() could lock the mutexes, but that wouldn't really help as the cache would be in a bad state after Close() anyway, you are not supposed to do that.

Point is, I don't think it's the problem that the CI found. It looks more like a race at the filesystem level. @smoyer64 speculated that it might be related to https://cs.opensource.google/go/go/+/refs/tags/go1.19.4:src/testing/testing.go;drc=38cfb3be9d486833456276777155980d1ec0823e;l=1229

@zinderic
Copy link
Contributor Author

I checked the link and the comments suggest something specific to Windows. But if it's specific to Windows then why the test fails also on MacOS if run enough times? Running 10 tests like that has about 50-60% error rate on my laptop:

go test -race -v -run TestCache -count=10 ./cache

In other words 5 or 6 out of 10 fail when run with -count=10. Running them one at a time seems to produce less failures but it still has significant chance of failing. Those observations were made on Mac Air M2 so I don't think that's specific to just Windows.

@smoyer64
Copy link
Collaborator

smoyer64 commented Dec 31, 2022

This might be related to #809 - running tests in parallel should work since the test environments are built individually and theoretically isolated (and use different temporary directories). If there's a concurrency issue with the code-under-test, that might explain a bit more. Intermittent failures are still occurring on the Windows build (one happened this morning) and this PR might pass if the check is rerun - I'll try that right now.

@MichaelMure MichaelMure force-pushed the feat-cleanup-subcommand branch 3 times, most recently from 7b8c9bc to 5e94d06 Compare January 4, 2023 14:11
@zinderic
Copy link
Contributor Author

zinderic commented Jan 5, 2023

Looks like a different test fails now - TestGoGitRepo but it fails in similar way as before. This time I cannot reproduce the failure on MacOS.

@smoyer64
Copy link
Collaborator

smoyer64 commented Jan 5, 2023

I'll take a look at this after work today - if we track enough of these down, then t.Parallel() should eventually work (with the obvious exclusions (see t.Setenv(), etc.)

@smoyer64
Copy link
Collaborator

smoyer64 commented Jan 7, 2023

Even the test that I thought I fixed fails intermittently again - I wonder if there's something in the refactoring. At least the more recent failure gives a better reason (a locked file handle).

@MichaelMure
Copy link
Owner

I wonder if there's something in the refactoring

The refactor and this PR add more file or folder deletion. I think it's the reason why we see those issues more now. Of course it's possible there is actually an issue in the code, but it's not obvious to me what it would be. It seems more compelling to me that it would be a variation of https://cs.opensource.google/go/go/+/refs/tags/go1.19.4:src/testing/testing.go;drc=38cfb3be9d486833456276777155980d1ec0823e;l=1229

@smoyer64
Copy link
Collaborator

smoyer64 commented Jan 8, 2023

Once the test is fixed

Except we're not leaving files behind any more - and the error message is now about being denied access to a file. I guess I'm going to have to boot into my Windows partition and see what's actually happening.

@MichaelMure MichaelMure force-pushed the feat-cleanup-subcommand branch 2 times, most recently from 3907cfe to 658bc1e Compare January 11, 2023 13:54
@MichaelMure
Copy link
Owner

Still happening:

    repo_cache_test.go:163: 
        	Error Trace:	/home/runner/work/git-bug/git-bug/cache/repo_cache_test.go:163
        	Error:      	Received unexpected error:
        	            	stat /tmp/TestCache162481194/001/.git/refs/bugs/fd954a37b9dc52742fbb2e2c60f62fc6ec0df85a2a2423b2b7ee5f1b8e4fad62: no such file or directory
        	Test:       	TestCache

@MichaelMure
Copy link
Owner

That looks like a os.PathError or fs.PathError (they are the same type).

theory:

  • TestCache create a bunch of identities and bugs
  • TestCache call Remove on one identity and one bug, which in turn call repo.RemoveRef(ref) and GoGitRepo.RemoveRef <-- Let's assume this deletion is not immediate
  • TestCache call RemoveAll, which create a list of refs to remove with ListLocalIds, which call GoGitRepo.ListRefs, and later call GoGitRepo.RemoveRef to remove those <-- this is where things fail

If somehow go-git is racing on listing+removal of git refs, it's possible that RemoveAll get a ref that is already being removed, and later would fail when trying to delete it. go-git is subject to race in other places so that wouldn't be unheard of.

Another issue with go-git, is that removing a ref should ideally be idempotent (aka not fail if the ref is gone already), but that doesn't seem to be the case.

@smoyer64
Copy link
Collaborator

Interesting ... and that sounds a lot like what I was going to go spelunking for in go-git. I think something is racing and I was going to put a xx mS delay before the call just to see if maybe the OS is still in the process of closing the file when the test tries to delete it.

@MichaelMure
Copy link
Owner

I found the issue. Turns out, that was not at all the right place: go-git/go-git#659

@MichaelMure
Copy link
Owner

go-git/go-git#659 might not be perfect as it triggered the CI there, but does solve our problem. Problem which isn't related to that PR, so I'll merge and we can follow up later with go-git.

@MichaelMure MichaelMure merged commit 98eb1c1 into MichaelMure:master Jan 14, 2023
@zinderic zinderic deleted the feat-cleanup-subcommand branch January 16, 2023 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants