Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkout bricks a self-hosted runner and cannot recover #1148

Open
kvanbere opened this issue Jan 30, 2023 · 14 comments
Open

Checkout bricks a self-hosted runner and cannot recover #1148

kvanbere opened this issue Jan 30, 2023 · 14 comments

Comments

@kvanbere
Copy link

kvanbere commented Jan 30, 2023

Something went wrong, and all of our self-hosted runners checked out bad .git folders or somehow corrupted them. It happened on around 13 of our runners at the same time. I think it was a random occurrence, because I had to manually login and delete the repository folder, and then it was fine.

Here are our logs:

2023-01-30T02:56:34.9249114Z Waiting for a runner to pick up this job...
2023-01-30T04:54:24.3969588Z Job is about to start running on the runner: XXXXXXXXXXXXXXXXXXXXXXXX (organization)
2023-01-30T04:54:29.3070556Z Current runner version: '2.301.1'
2023-01-30T04:54:29.3077744Z Runner name: 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
2023-01-30T04:54:29.3078128Z Runner group name: 'Default'
2023-01-30T04:54:29.3078642Z Machine name: 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
2023-01-30T04:54:29.3080746Z ##[group]GITHUB_TOKEN Permissions
2023-01-30T04:54:29.3081343Z Actions: write
2023-01-30T04:54:29.3081520Z Checks: write
2023-01-30T04:54:29.3081693Z Contents: write
2023-01-30T04:54:29.3081906Z Deployments: write
2023-01-30T04:54:29.3082186Z Discussions: write
2023-01-30T04:54:29.3082429Z Issues: write
2023-01-30T04:54:29.3082608Z Metadata: read
2023-01-30T04:54:29.3082779Z Packages: write
2023-01-30T04:54:29.3082958Z Pages: write
2023-01-30T04:54:29.3083147Z PullRequests: write
2023-01-30T04:54:29.3083476Z RepositoryProjects: write
2023-01-30T04:54:29.3083696Z SecurityEvents: write
2023-01-30T04:54:29.3083888Z Statuses: write
2023-01-30T04:54:29.3084056Z ##[endgroup]
2023-01-30T04:54:29.3087171Z Secret source: Actions
2023-01-30T04:54:29.3087569Z Prepare workflow directory
2023-01-30T04:54:29.4388409Z Prepare all required actions
2023-01-30T04:54:29.4550014Z Getting action download info
2023-01-30T04:54:29.8524043Z Download action repository 'actions/checkout@v3' (SHA:ac593985615ec2ede58e132d2e21d2b1cbd6127c)
2023-01-30T04:54:30.9083915Z Complete job name: XXXXXXXXXXXXXXXXXXXXXXXX
2023-01-30T04:54:31.0985565Z ##[group]Run actions/checkout@v3
2023-01-30T04:54:31.0985877Z with:
2023-01-30T04:54:31.0986059Z   repository: XXXXXXXX/XXXXXXXX
2023-01-30T04:54:31.0986462Z   token: ***
2023-01-30T04:54:31.0986609Z   ssh-strict: true
2023-01-30T04:54:31.0986786Z   persist-credentials: true
2023-01-30T04:54:31.0986951Z   clean: true
2023-01-30T04:54:31.0987092Z   fetch-depth: 1
2023-01-30T04:54:31.0987234Z   lfs: false
2023-01-30T04:54:31.0987377Z   submodules: false
2023-01-30T04:54:31.0987547Z   set-safe-directory: true
2023-01-30T04:54:31.0987702Z env:
2023-01-30T04:54:31.0987887Z   TMP: C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX/.temp
2023-01-30T04:54:31.0988151Z   TEMP: C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX/.temp
2023-01-30T04:54:31.0988398Z   TMPDIR: C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX/.temp
2023-01-30T04:54:31.0988665Z   MATLAB_PREFDIR: C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX/.preferences
2023-01-30T04:54:31.0988870Z ##[endgroup]
2023-01-30T04:54:34.6968863Z Syncing repository: XXXXXXXX/XXXXXXXX
2023-01-30T04:54:34.6970512Z ##[group]Getting Git version info
2023-01-30T04:54:34.6970936Z Working directory is 'C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX'
2023-01-30T04:54:34.6971402Z [command]"C:\Program Files\Git\cmd\git.exe" version
2023-01-30T04:54:34.7493487Z git version 2.36.1.windows.1
2023-01-30T04:54:34.7592122Z ##[endgroup]
2023-01-30T04:54:34.7607048Z Temporarily overriding HOME='C:\runner\e595c9b9\_work\_temp\bcafa367-f8cb-4d31-84b1-63d10aaaabed' before making global git config changes
2023-01-30T04:54:34.7607516Z Adding repository directory to the temporary git global config as a safe directory
2023-01-30T04:54:34.7608114Z [command]"C:\Program Files\Git\cmd\git.exe" config --global --add safe.directory C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX
2023-01-30T04:54:34.8483251Z [command]"C:\Program Files\Git\cmd\git.exe" config --local --get remote.origin.url
2023-01-30T04:54:34.8992096Z ##[error]fatal: --local can only be used inside a git repository
2023-01-30T04:54:34.9013542Z Deleting the contents of 'C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX'
2023-01-30T04:54:35.0573716Z ##[error]EPERM: operation not permitted, unlink 'C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX\.git'
2023-01-30T04:54:35.4710729Z Post job cleanup.
2023-01-30T04:54:38.8875206Z Cleaning up orphan processes

In this case, checkout seems to be bailing fatally, i.e. after the error fatal: --local can only be used inside a git repository, the actions run ends immediately with a fault and won't try and continue.

This effectively bricked the runner because any jobs that the bad runner would pick up would fail instantly. Not only that, but the bad runner would take all the jobs in the queue and virtually instantly fail them, which messed up our job history quite a bit unfortunately.

Since the resolution step was simply to login and delete the offending bad folder, it would be nice if it would automatically nuke away the folder and retry once.

It seems like it tried this:

2023-01-30T04:54:34.9013542Z Deleting the contents of 'C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX'
2023-01-30T04:54:35.0573716Z ##[error]EPERM: operation not permitted, unlink 'C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX\.git'

I am not sure why that didn't work, since I was able to login and just rm the folder fine as the same user. In any case, all 13 runners failed to delete the folder automatically.

To reproduce, I would suggest:

  • Install self hosted runner on Windows Server 2022 running as a service and using a non-admin service user (i.e. Bob)
  • Setup action to checkout repository
  • Manually corrupt the .git folder by adding extra random files into it (?)
  • Ensure git config --local --get remote.origin.url fails
  • Observe consequent jobs acquired by this runner will fail instantly and it will fail to recover
@kvanbere
Copy link
Author

kvanbere commented Jan 30, 2023

Depending on how this is addressed, it could also fix other issues i.e: #933 , since that issue with submodule corruption is also fixed by just deleting the repo and allowing the runner to do a fresh clone ( #988 (comment) ).

For example, as a broad workaround it could give up on reusing the existing git repository if any commands throw a fault, and try to delete and checkout the repository from scratch.

@olzhas
Copy link

olzhas commented Mar 1, 2023

Sometime ago there was a fix for this was introduced #964, but it seems it doesn't solve the issue. I might be wrong.

@jbaryy708

This comment was marked as spam.

@kvanbere
Copy link
Author

kvanbere commented Mar 1, 2023

Sometime ago there was a fix for this was introduced #964, but it seems it doesn't solve the issue. I might be wrong.

We are using checkout v3 and this still seems to be an issue.

@tyteen4a03
Copy link

Hi, also running into this issue.

@kvanbere
Copy link
Author

kvanbere commented Apr 1, 2023

Does anyone have a workaround for this?

@jbaryy708
Copy link

Hi.... how fix if runners please send me your txt....

@Ajaydip
Copy link

Ajaydip commented Apr 1, 2023

I have been using the following workaround while waiting for the fix:

- name: checkout
  id: checkout
  uses: actions/checkout@v3
  with:
    ref: ${{ inputs.ref }}
    submodules: "recursive"
    token: ${{ secrets.token }}

- name: cleanup runner workspace
  run: |
    echo $GITHUB_WORKSPACE
    rm -rf $GITHUB_WORKSPACE
    mkdir $GITHUB_WORKSPACE
  shell: bash
  if: ${{ failure() && steps.checkout.conclusion == 'failure' }}

This atleast prevents the runner from being bricked if checkout fails either due to corrupted .git folder or bad submodules.

@kvanbere
Copy link
Author

kvanbere commented Apr 3, 2023

Good workaround, thanks!

@kvanbere
Copy link
Author

kvanbere commented Apr 3, 2023

I just wanted to add that I ran into this one today:

Warning: Unable to clean or reset the repository. The repository will be recreated instead.
Deleting the contents of 'C:\runner\31f270db\_work\aaaa\bbbb'
Error: File was unable to be removed Error: EBUSY: resource busy or locked, rmdir 'C:\runner\31f270db\_work\aaaa\bbbb\work'

It then went ahead and gobbled up all the remaining jobs in the entire queue and failed them all with the same error.

Edit: Seems like the above is an unrelated issue to what is mentioned in the first post, this time there was some random cc1plus process hanging around that had a lock on a directory in the git folder and it seemed to have gotten stuck and was preventing git clean from running. I don't expect the checkout action to hunt down and kill processes, but I think I will fix this with a powershell script.

@kvanbere
Copy link
Author

Happened again in a big way today :(

@kvanbere
Copy link
Author

kvanbere commented Aug 29, 2023

@Ajaydip I tried your workaround and it didn't work for me, it always skips the action?

  run-tests:
    name: xxxx
    runs-on: [self-hosted]
    timeout-minutes: 90
    strategy:
      fail-fast: false
      matrix:
        include: ${{fromJson(needs.scan-tests.outputs.matrix)}}
    steps:
      - uses: actions/checkout@v3
        id: checkout
        timeout-minutes: 10
        continue-on-error: true
      - name: Cleanup previously failed job
        run: |
          Remove-Item "${{env.GITHUB_WORKSPACE}}" -Force -Recurse -ErrorAction SilentlyContinue | Out-Null
          New-Item -ItemType Directory -Force -Path "${{env.GITHUB_WORKSPACE}}" | Out-Null
        if: ${{ steps.checkout.conclusion == 'failure' }}
      - uses: actions/checkout@v3
        if: ${{ steps.checkout.conclusion == 'failure' }}

image

I did modify it a little bit .. I was hoping to be able to recover and run the rest of the pipeline unaffected without having to put an if: ... on every step.

Edit:
If you've done what I did above, you probably want to use outcome not conclusion -- https://docs.github.com/en/actions/learn-github-actions/contexts#steps-context .

@bryanjtc
Copy link

bryanjtc commented Nov 9, 2023

Any update on this? Does anyone have a working workaround?

@kvanbere
Copy link
Author

kvanbere commented Nov 9, 2023

@bryanjtc the workaround above works OK, just note my Edit about using ‘outcome’ not ‘conclusion’ for testing whether to retry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants
@tyteen4a03 @olzhas @kvanbere @bryanjtc @jbaryy708 @Ajaydip and others