Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc exp run: fails when used with github actions and CML docker container #385

Open
tasdomas opened this issue Oct 10, 2022 · 22 comments
Open
Labels
C: ref Content of /doc/*-reference documentation Markdown files

Comments

@tasdomas
Copy link
Contributor

Description

dvc get fails in one of our example repos with the error response:

ERROR: unexpected error - Repository not found at /__w/stale-model-example/stale-model-example/.git/

None of the directories in that path are symlinks.

Running the dvc get command interactively (in an ssh session via action-tmate results in success - the action completes without error.

Running with the -vv flag gives the following output:

2022-09-28 18:08:49,331 TRACE: Namespace(all_pipelines=False, cd='.', checkpoint_resume=None, cmd='run', cprofile=False, cprofile_dump=None, downstream=False, dry=False, force=False, force_downstream=False, func=<class 'dvc.commands.exper
iments.run.CmdExperimentsRun'>, instrument=False, instrument_open=False, interactive=False, jobs=1, machine=None, metrics=False, name=None, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<clas
s 'argparse.RawTextHelpFormatter'>, conflict_handler='error', add_help=False), pdb=False, pipeline=False, pull=False, queue=False, quiet=0, recursive=False, reset=False, run_all=False, set_param=[], single_item=False, targets=[], tmp_dir=
False, verbose=2, version=None, viztracer=False, viztracer_depth=None, yappi=False)                                                                                                                                                           
2022-09-28 18:08:49,709 ERROR: unexpected error - Repository not found at /__w/stale-model-example/stale-model-example/.git/
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/dvc/cli/__init__.py", line 185, in main
    ret = cmd.do_run()
  File "/usr/local/lib/python3.8/dist-packages/dvc/cli/command.py", line 22, in do_run
    return self.run()
  File "/usr/local/lib/python3.8/dist-packages/dvc/commands/experiments/run.py", line 32, in run
    results = self.repo.experiments.run(
  File "/usr/local/lib/python3.8/dist-packages/dvc/repo/experiments/__init__.py", line 521, in run
    return run(self.repo, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dvc/repo/experiments/run.py", line 60, in run
    return repo.experiments.reproduce_one(
  File "/usr/local/lib/python3.8/dist-packages/dvc/repo/experiments/__init__.py", line 119, in reproduce_one
    self.queue_one(exp_queue, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dvc/repo/experiments/__init__.py", line 156, in queue_one
    return self.new(
  File "/usr/local/lib/python3.8/dist-packages/dvc/repo/experiments/utils.py", line 40, in wrapper
    return f(exp, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dvc/repo/experiments/__init__.py", line 269, in new
    return queue.put(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dvc/repo/experiments/queue/workspace.py", line 25, in put
    return self._stash_exp(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dvc/repo/experiments/queue/base.py", line 315, in _stash_exp
    with self.scm.detach_head(client="dvc") as orig_head:
  File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.8/dist-packages/scmrepo/git/__init__.py", line 404, in detach_head
    self.checkout(rev, detach=True, force=force)
  File "/usr/local/lib/python3.8/dist-packages/scmrepo/git/__init__.py", line 283, in _backend_func
    for key, backend in self.backends.items():
  File "/usr/lib/python3.8/_collections_abc.py", line 744, in __iter__
    yield (key, self._mapping[key])
  File "/usr/local/lib/python3.8/dist-packages/scmrepo/git/__init__.py", line 49, in __getitem__
    initialized = backend(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.8/dist-packages/scmrepo/git/backend/pygit2.py", line 97, in __init__
    self.repo = pygit2.Repository(path)
  File "/usr/local/lib/python3.8/dist-packages/pygit2/repository.py", line 1620, in __init__
    path_backend = init_file_backend(path, flags)
_pygit2.GitError: Repository not found at /__w/stale-model-example/stale-model-example/.git/
------------------------------------------------------------
2022-09-28 18:08:50,051 DEBUG: Version info for developers:
DVC version: 2.28.0 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.15.0-1020-azure-x86_64-with-glibc2.29
Subprojects:
        dvc_data = 0.13.0
        dvc_objects = 0.5.0
        dvc_render = 0.0.11
        dvc_task = 0.1.2
        dvclive = 0.11.0
        scmrepo = 0.1.1
Supports:
        gs (gcsfs = 2022.8.2),
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2022.8.2, boto3 = 1.24.59)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: s3, gs
Workspace directory: ext4 on /dev/root
Repo: dvc, git

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2022-09-28 18:08:50,053 DEBUG: Analytics is enabled.
2022-09-28 18:08:50,117 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmp3gfjynwi']'
2022-09-28 18:08:50,119 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmp3gfjynwi']'

Reproduce

The only way of reproducing this that I've found is via github actions.

Expected

The dvc get command should download the requested data.

Environment information

Output of dvc doctor:

DVC version: 2.30.0 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.15.0-1020-azure-x86_64-with-glibc2.29
Subprojects:
	dvc_data = 0.17.1
	dvc_objects = 0.7.0
	dvc_render = 0.0.12
	dvc_task = 0.1.3
	dvclive = 0.12.0
	scmrepo = 0.1.1
Supports:
	gs (gcsfs = 2022.8.2),
	http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
	s3 (s3fs = 2022.8.2, boto3 = 1.24.59)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: s3, gs
Workspace directory: ext4 on /dev/root
Repo: dvc, git
@pmrowla
Copy link

pmrowla commented Oct 10, 2022

It looks like exp run is what's failing, not dvc get

@pmrowla
Copy link

pmrowla commented Oct 10, 2022

The issue is probably due to how actions/checkout works:

When Git 2.18 or higher is not in your PATH, falls back to the REST API to download the files.

(see https://github.com/actions/checkout#checkout-v3)

I'm guessing that the CML docker image doesn't include git >2.18, so using actions/checkout inside that docker container will just download the source using the github web API (and does not create an actual git repo in the workspace), so dvc exp run will fail since it needs an actual git repo

@pmrowla pmrowla changed the title dvc get: fails in github workflow with "Repository not found" exp run: fails when used with github actions actions/checkout and CML docker container Oct 10, 2022
@tasdomas
Copy link
Contributor Author

tasdomas commented Oct 10, 2022

The cml docker image include git==2.38 and the git repository is created normally.

Including additional debug information:

git version

+ git -v
git version 2.38.0

Directory contents

+ ls -alh
total 608K
drwxr-xr-x 9 1001  121 4.0K Oct 10 12:56 .
drwxr-xr-x 3 1001  121 4.0K Oct 10 12:56 ..
drwxr-xr-x 2 root root 4.0K Oct 10 12:56 data
drwxr-xr-x 4 root root 4.0K Oct 10 12:58 .dvc
-rw-r--r-- 1 root root  138 Oct 10 12:56 .dvcignore
-rw-r--r-- 1 root root 1.2K Oct 10 12:56 dvc.lock
-rw-r--r-- 1 root root  782 Oct 10 12:56 dvc.yaml
drwxr-xr-x 8 root root 4.0K Oct 10 12:56 .git
drwxr-xr-x 3 root root 4.0K Oct 10 12:56 .github
-rw-r--r-- 1 root root   73 Oct 10 12:56 .gitignore
drwxr-xr-x 2 root root 4.0K Oct 10 12:56 notebooks
-rw-r--r-- 1 root root   52 Oct 10 12:56 params.yaml
-rw-r--r-- 1 root root 130K Oct 10 12:56 prc.json
-rw-r--r-- 1 root root  603 Oct 10 12:56 README.md
drwxr-xr-x 2 root root 4.0K Oct 10 12:56 reports
-rw-r--r-- 1 root root  100 Oct 10 12:56 requirements.txt
-rw-r--r-- 1 root root 408K Oct 10 12:56 roc.json
-rw-r--r-- 1 root root   73 Oct 10 12:56 scores.json
drwxr-xr-x 2 root root 4.0K Oct 10 12:56 src

@pmrowla
Copy link

pmrowla commented Oct 10, 2022

@tasdomas did you check that running the container yourself (or via ssh into it) or did you check it inside the actions workflow? It matters because the workflow PATH may not be the same

@pmrowla
Copy link

pmrowla commented Oct 10, 2022

This issue is specific to using the CML container

import pygit2

pygit2.Repository(os.getcwd())

succeeds in regular GHA ubuntu-latest workflows but fails when using the CML container

see: https://github.com/pmrowla/gha-test/actions

@pmrowla
Copy link

pmrowla commented Oct 10, 2022

The problem is that in the CML container the github workspace is created & owned by root but the actions are run as a different user. In order for git to work properly this means that you have to configure safe.directory (https://git-scm.com/docs/git-config/2.35.2#Documentation/git-config.txt-safedirectory`) to include your workspace path, otherwise both regular git and libgit2 will refuse to open the repository. In actions/checkout, they explicitly set this config var using a temporary global config (which only exists for the duration of the actions/checkout script step).

If you add

/usr/bin/git config --global --add safe.directory $GITHUB_WORKSPACE

to your workflow run (anywhere before dvc exp run) it should work as expected.

This is really something that should be probably be handled automatically in the CML container (perhaps via the --system level git config for the entire container)

see: https://github.com/pmrowla/gha-test/actions/runs/3219831064/jobs/5265728412

@pmrowla
Copy link

pmrowla commented Oct 10, 2022

For reference, dulwich does not do the user permssion/safe.directory check at all which is why (almost) everything else in DVC works in the CML container other than exp run (which gets into some of the pygit only scmrepo git behavior)

@pmrowla
Copy link

pmrowla commented Oct 10, 2022

Looking at it again I think exp run may actually fail unless you run it as root (with sudo) in this scenario (even if using safe.directory lets us open the repo in pygit). exp run needs to create new git commits and refs in .git/, but if it's owned by root and DVC is running as a different user in the workflow you will eventually hit a permissions error.

I haven't tested this, so I'm not sure what the behavior inside custom containers and GHA is for git write operations. But if this is the case (and we can't write commits) this is also something that would need to be addressed on the CML side. (Either users need to use sudo in GHA workflows or the workspace needs to be chown'd to the proper user in the container)

@tasdomas
Copy link
Contributor Author

whoami indicates the action is run as root

@pmrowla
Copy link

pmrowla commented Oct 10, 2022

@tasdomas it's an interaction w/ how the GHA workspace is mounted inside the docker container, even though whoami reports root, the user permissions won't match from git's perspective unless you set safe.directory

related: actions/runner#2033

@pmrowla pmrowla changed the title exp run: fails when used with github actions actions/checkout and CML docker container exp run: fails when used with github actions and CML docker container Oct 10, 2022
@tasdomas
Copy link
Contributor Author

Thanks @pmrowla - this looks like it's the cause.

@pmrowla pmrowla reopened this Oct 10, 2022
@pmrowla pmrowla transferred this issue from iterative/dvc Oct 10, 2022
@pmrowla
Copy link

pmrowla commented Oct 10, 2022

reopening this and transferring to CML for visibility

@pmrowla pmrowla changed the title exp run: fails when used with github actions and CML docker container dvc exp run: fails when used with github actions and CML docker container Oct 10, 2022
@casperdcl casperdcl added the p0-critical Max priority (ASAP) label Oct 10, 2022
@dacbd
Copy link
Contributor

dacbd commented Oct 12, 2022

@dacbd
Copy link
Contributor

dacbd commented Oct 12, 2022

I'm not sure if this is a p0 / or if this is something fixable on our end.

It does not affect cml runner instances because they are running as root already, where most of our examples exist (/purpose) for using the cml container.

We can't chown the directory for users either, I tried a few other things but came up dry...

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Oct 12, 2022

Running any cml command (e.g. cml ci to fix the repository configuration) will also disable the safe.directory checks on Git, as per iterative/cml#970, iterative/cml#974 and iterative/cml#986.
https://github.com/iterative/cml/blob/c97e5481fb91932dc137da94de0a189d54bdc694/src/cml.js#L118-L120
https://github.com/iterative/cml/blob/c97e5481fb91932dc137da94de0a189d54bdc694/src/cml.js#L83-L116

@0x2b3bfa0
Copy link
Member

Not to mention that the official checkout action does the same by default since actions/checkout#762 and actions/checkout#770:

  set-safe-directory:
    description: Add repository path as safe.directory for Git global config by running `git config --global --add safe.directory <path>`
    default: true

@0x2b3bfa0 0x2b3bfa0 removed the p0-critical Max priority (ASAP) label Oct 12, 2022
@tasdomas
Copy link
Contributor Author

@0x2b3bfa0 the checkout action only applies this change during checkout and it does not persist

@dacbd
Copy link
Contributor

dacbd commented Oct 12, 2022

it does seems the quickest solution is to add a cml ci --token ${{ github.token }}

@0x2b3bfa0
Copy link
Member

@0x2b3bfa0 the checkout action only applies this change during checkout and it does not persist

Thanks! 🤦🏼‍♂️

@pmrowla
Copy link

pmrowla commented Oct 13, 2022

Does the setup-cml action run cml ci?

It seems like the only thing that needs to be done here is to document that using DVC in github actions with the CML docker image essentially requires running cml ci (or using the setup-cml action assuming it does the same thing). https://github.com/iterative/cml#using-cml-with-dvc

(and all of the iterative example repos should use cml ci/setup-cml as well)

@0x2b3bfa0
Copy link
Member

Does the setup-cml action run cml ci?

Err... no, the pull request was closed yesterday iterative/setup-cml#58 😅

@casperdcl
Copy link
Contributor

so potentially an FAQ/known issue (missing cml ci)

@casperdcl casperdcl transferred this issue from iterative/cml Nov 18, 2022
@jorgeorpinel jorgeorpinel added documentation Markdown files C: ref Content of /doc/*-reference labels Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: ref Content of /doc/*-reference documentation Markdown files
Projects
None yet
Development

No branches or pull requests

6 participants