Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable multiple revisions for project repos #4659

Draft
wants to merge 71 commits into
base: master
Choose a base branch
from

Conversation

marcodelapierre
Copy link
Member

@marcodelapierre marcodelapierre commented Jan 15, 2024

This PR adds support for handling local copies of multiple revisions of the same pipeline.

Key points:

  • Under NXF_ASSETS, each pipeline is now pulled as <org>/<repo>[:<revision>] ; :<revision> is only appended if the corresponding flag was used on CLI, -r/--revision
  • The key implementation change is adding the revision attribute to the AssetManager class, as both pipeline name and revision are now required to fully identify a pipeline
  • Support has been added and tested for all relevant CLI commands: run, pull, clone, drop, list, view, config, info, inspect, kuberun
  • Unit tests for AssetManagerTest have been updated

Caveats:

  • Jgit does not implement git worktree, so the original idea within Allow the concurrent run of multiple pipeline revisions #2870 could not be applied
  • depth = 1 (shallow clones) was investigated to reduce disk usage, but could not be implemented as it would not allow to checkout branches/tags/commits
  • In the current implementation, pulling without --revision and with --revision <default branch> create two duplicate pulls; this is not optimal, but with very limited known negative impact; [update] this only happens if the default branch is not declared in the manifest and differs from master
  • As a by-product, if a repo has a default branch different from master, pulling and running it is now possible without specifying the branch name explicitly

Closes #2870 .
Also indirectly addresses #3593

Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Copy link

netlify bot commented Jan 15, 2024

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 0cb331a
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/66573878be6dc300083999d7

@marcodelapierre marcodelapierre self-assigned this Jan 15, 2024
@marcodelapierre marcodelapierre changed the title Multiple revisions 1st: AssetManager, Cmd Clone Pull Run Enable multiple revisions for SCMs Jan 15, 2024
@marcodelapierre marcodelapierre marked this pull request as draft January 15, 2024 12:39
@marcodelapierre marcodelapierre changed the title Enable multiple revisions for SCMs Enable multiple revisions for project repos Jan 15, 2024
@bentsherman
Copy link
Member

You could save disk space by cloning with depth = 0 when a revision is specified

@marcodelapierre
Copy link
Member Author

marcodelapierre commented Jan 16, 2024

You could save disk space by cloning with depth = 0 when a revision is specified

Thanks @bentsherman , great idea, will do, for nextflow pull but not for clone. Why not for any case, including no revision specified? - I just have to check that a subsequent nextflow pull still works (for the purpose of updating the local copy).

Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
@bentsherman
Copy link
Member

yes I suppose we could just clone with depth = 0 in all cases, since no revision just defaults to master. the clone, pull, and run commands all provide a -deep option for this, so users can change it if they want

Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
@marcodelapierre
Copy link
Member Author

Ah! unfortunately we cannot implement the deep functionality, as with a depth of 1 we cannot checkout to other branches/tags/commits.
This would be supported : git clone -b <branch/tag> --deep <deep> <repo> , but only works for branches and tags, not commits. Git still does not support a way to clone to a specific commit straight away.

Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
… operation

Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
…f "master"

Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
@marcodelapierre marcodelapierre marked this pull request as ready for review January 18, 2024 08:11
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
@marcodelapierre
Copy link
Member Author

I have been studying and making experiments using a local bare repository, as a local intermediate between the remote repo and the various checked out revisions.

In my first round of attempts, I was trying to use the local bare repo as the source of truth for any local checkouts, in substitution for the remote repo. Essentially, the idea was:

  • clone the bare from the remote, only update when needed
  • otherwise, for most other operations use the local bare as the source of truth

This does not work at present, as the general assumption in AssetManager methods is that the "source of truth" has everything available, i.e. both repo information and file contents. However a bare repo lacks the latter.

Hence, I started an attempt of targeted usage of either bare or remote, depending on the required information. This not only is confusing, but currently fails in my tests, again because of assumptions that cross each other around the existence and contents of the various types of repos.

At the end of today:

To bring the effort to a more manageable and productive ground, I have decided to move forward with a radically simplified approach:

  • keep using the remote repo as much as possible
  • only use the local bare repository to map requested revisions into commit IDs

This approach, if workable, tackles the main goal of this round of PR revision: to locally use commits to store and run pipelines, to avoid clashes when updating repositories.

Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
@marcodelapierre
Copy link
Member Author

marcodelapierre commented May 17, 2024

I have started to add localBarePath. Scope is:

  • to dereference branches/tags into commit IDs
  • to source the local git config file

Ok, I have fixed the chicken and egg at AssetManager instantiation, between defining localPath and devining the repository provider. Using localBarePath has been instrumental.

Next:

  • actual dereferentiation to commits
  • effective update of bare repo, on demand and transparent (i.e. only when download or run with latest)

To this end I have left some intermediate notes in the codebase, AssetManager.groovy.

@marcodelapierre
Copy link
Member Author

@pditommaso Any tips on how to get a commitID for a branch or tag, I am a taker.
Otherwise, I will go into it next week.

@marcodelapierre marcodelapierre marked this pull request as draft May 17, 2024 08:24
@pditommaso
Copy link
Member

Not sure how much I can help this week. @bentsherman you want to have look at this

Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
@marcodelapierre
Copy link
Member Author

thanks Paolo. did good progress myself.

Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
@marcodelapierre
Copy link
Member Author

Good progress today.

Notes on revision dereferencing (working)
- key attributes are localPath and commitId
- do it at instantiation
- if cmdpull/cmdrun, use updateLocalBareRepo() to update the attributes
- also use updateLocalBareRepo() if revision cannot be found locally

        Let's simplify and minimise usage of the local bare
        0. get local gitconfig
        1. dereference revisions (branches!/tags?) to commits
        2. use commits in localpath
        3. use commits in CLI Cmds
        4. use commits for download()
        5. use local bare path just to dereference locally without need for remote repo, and for local gitconfig
        6. also use local bare for download() of revisions/commits

@marcodelapierre
Copy link
Member Author

marcodelapierre commented May 27, 2024

Outstanding bits right now:

  1. In the current implementation, the AssetManager has been changed with an extra revision argument.

    • I know this is not ideal because it changes the class API e.g. in relation to Seqera Platform.
    • however, if we get rid of the extra argument, we will be forcibly in a situation where, at the time an Asset Manager is instantiated, we don't know the corresponding localPath of the pipeline, because the latter is defined by the commit ID, and hence the requested revision. This will still require some change of how Platform interacts with the AssetManager class.
  2. Updating of bare repo ideally would be specific to the requested revision, however Jgit by default sets it up to update ALL branches&tags

    • the rationale here is that we need a mechanism to update the local bare repo with respect to the remote. On one hand, this should be controlled by some user action, so as to maintain a revision <> commit correspondence locally (for increased reproducibility). On the other hand, ideally the update is as transparent as possible: I would avoid an extra option in the CLI. Hence, updating every time there is a pull or run -latest could be an option.
    • need to test something like this: git.fetch().setRefSpecs("+refs/heads/<branch>:refs/heads/<branch>")
    • what if a commit ID is requested though? which branch should be updated?
  3. The wrong revision/commit is printed at run time, as in "Launching ..."

    • also, no notice when remote branch is updated (probably related)
    • how is RevisionInfo used in the run algorithm?
    • shall we revise what information we print? I think the 2 key pieces of information are the user-requested revision and the corresponding commit ID. We are storing repos by commit ID, but this implies that the same pulled repo can belong to multiple branches/tags.
  4. Brittle algorithm at the moment:

    • instantiation: decode revision to commit ID (so as to define localPath)
    • if pull/run: update local bare -> can result in mismatch of revision vs commit
  5. Update mechanics of list/info commands (yet to be done)

Point 3. might actually have been already there with the original implementation - have to check. Issues arise when multiple branches/tags all have the same commits in their history. It is amplified in my tests, as I am using dummy pipeline repos.

Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
@marcodelapierre
Copy link
Member Author

On point 2. above, wondering whether we should leverage user-provided information more, when it comes to printing out information on repository in use. I.e.:

  • commit ID as de-referenced from user request
  • (if not null) revision as requested by the user
  • (eventually, also infer branch)

Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>
@marcodelapierre
Copy link
Member Author

As dicussed in meeting, on point 1., let me re-implement with the original class signature for AssetManager - no revision attribute. Commit ID and hence local path will be set when the revision is set.

Note on available code so far:

  1. current branch for this PR is add/mult_revisions - impl with local bare repo, and commit id in local path
  2. branch add/mult_revisions_bkp15apr - first impl, uses revision in local path
  3. branch add/mult_revisions_bkp29may - backup of 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants