Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a dry-run option to dregsy #71

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

daconstenla
Copy link
Contributor

This code is adding a new optional parameter --dry-run to dregsy's command line.
It allows easier debugging of matching tags based on the specified configuration without syncing all the information to the target docker registry.

  • faster
  • simple
  • cheaper

@xelalexv
Copy link
Owner

xelalexv commented Jun 1, 2022

Great PR! Thanks! A dry-run is definitely a good idea.

Before going into review, could you please remove the commit for building on AMR64? This is a separate concern and should be addressed separately. I think it's best to first open an issue for that. I've been thinking about adding cross-compilation to the project recently, but would probably solve that by moving the build into hack/devenvutil, and just make parameterized calls to that from the Makefile.

Once the commit is removed, please squash the remaining two commits, I'll then review.

@daconstenla daconstenla force-pushed the add-dry-run-command branch 2 times, most recently from c47d16a to 4ed09b5 Compare June 1, 2022 17:05
@daconstenla
Copy link
Contributor Author

Removed M1 related changes and stick to the feature of the dry-run execution.

Since I got into the code I've included additional information:

  • show tags from the target registry
  • show subtraction of (expanded tags) - (existing tags in target) ~ tags to be synced
  • show subtraction of (existing tags in target) - (expanded tags) ~ not synced with the current config

which might also help to find problems and / or information about current state. (also maybe cleanup on target)

@xelalexv
Copy link
Owner

xelalexv commented Jun 1, 2022

I've just merged another PR, so you will have to rebase. Sorry about that! But there are not that many conflicts, so it shouldn't be too complicated.

Mea culpa

@daconstenla
Copy link
Contributor Author

Saw it so just did it @xelalexv, code is ready to be reviewed 😜

xelalexv added a commit that referenced this pull request Jun 2, 2022
@xelalexv
Copy link
Owner

xelalexv commented Jun 2, 2022

I did a first review. Here are my comments:

  • I think we should push dry run handling into the relays. The code for expanding source tag lists is already present there, and auth & TLS verify work correctly. With the current implementation, getting tag lists may fail for certain registries, since certDir is not set.

  • Going into the relays would additionally exercise all the mechanics involved, just short circuit the actual sync. Have a look at the branch I pushed, it's a sketch of what I have in mind.

  • Target tag lists and result output could then be done via a helper object. This could actually be set on the relay instead of just a dry run flag.

  • When retrieving target tag lists, we need to consider the case where the target repo does not yet exist (we should not create it during a dry run). This is easy since all expanded source tags would then apply. We just need to handle it properly.

  • We could improve the output. The dry-run log output seems a bit hard to read, in particular when many tags are involved. Maybe we could create a JSON/YAML/diff file or whatever instead of writing to log. That could then also be processed with generic tools.

  • Make all interval tasks one-off. We may not want the dry-run to run indefinitely.

@xelalexv
Copy link
Owner

xelalexv commented Jun 3, 2022

I've just noticed that starting with Skopeo 1.8.0, the sync command which we use in dregsy also introduces a dry-run mode. I think we should take advantage of that once that Skopeo version is available via packages in Ubuntu and Alpine. This will have an impact on the design of this feature.

@xelalexv
Copy link
Owner

xelalexv commented Jun 3, 2022

I've just switched to building our own Skopeo binary, and merged those changes. So we can now use latest Skopeo release and start experimenting with Skopeo's dry-run feature.

@daconstenla
Copy link
Contributor Author

I've just switched to building our own Skopeo binary, and merged those changes. So we can now use latest Skopeo release and start experimenting with Skopeo's dry-run feature.

I'm going to rebase and give it a try, thanks @xelalexv 👌

@xelalexv
Copy link
Owner

xelalexv commented Jun 3, 2022

I'll be away for the next two weeks, so no need to rush 😉

daconstenla pushed a commit to daconstenla/dregsy that referenced this pull request Jun 3, 2022
@daconstenla
Copy link
Contributor Author

In my initial attempt to push the --dry-run down to relays and...

On skopeo's relay I've tried to use --dry-run native's option from skopeo's command and result is:

  • it requires to use sync subcommand instead of copy (it reminded me of Allow use of skopeo's sync command instead of copy #58)
  • it requires to reformat the parameters if we use sync
  • is not as fast as the proposed solution because it actually launches skopeo as many times as tags as we would sync tags

On docker's relay:

  • there is no native --dry-run option available as relay uses docker push directly, I guess we could "just print" the command if we'd like to do implement some kind of dry-run execution.

Conclusion, maybe we should delegate on the relies the tag obtention but not the actual dry run execution.

I'm going to try it out and will share more details, pushing the current state using native's dry-run option only on skopeo's relay.

@xelalexv
Copy link
Owner

xelalexv commented Jun 3, 2022

Ah, sorry about that! I was jumping to conclusions here. I guess I was too thrilled to see a dry run option in the Skopeo release notes 🤣 But we use skoepo copy, not sync, so that doesn't help for now. So I agree, we just push dry run into relays, then short circuit copy for Skopeo and pull & push for Docker, and make output available in some form.

daconstenla pushed a commit to daconstenla/dregsy that referenced this pull request Jun 7, 2022
@daconstenla
Copy link
Contributor Author

Interestingly I was going over the code of the relays and found that dockerrelay depends on skopeo's because docker does not support listing tags of a repository which makes me wonder relays are the right place for the dry-run to happen. The biggest advantage is that it keeps things more flexible and adjustable on relay level but source of information is always skopeo's implementation 😌 .

I wonder if we should implement (maybe outside of this pr) a custom relay function to list images without skopeo's dependency using a plain http request to the registry:

# list repositories
curl  'http://localhost:5000/v2/_catalog'
{"repositories":["mirror/nginx","nginx"]}

# list tags for nginx
curl  'http://localhost:5000/v2/nginx/tags/list'  
{"name":"nginx","tags":["latest"]}

# list tags for mirror/nginx
curl  'http://localhost:5000/v2/mirror/nginx/tags/list'
{"name":"mirror/nginx","tags":["1.21.1-perl","1.20-perl"]}

@daconstenla
Copy link
Contributor Author

daconstenla commented Jun 7, 2022

⚠️ I just realised dry-run output generates a single file with only the latest task, newby mistake 🤦
Adjusted, one file per task * mapping

daconstenla pushed a commit to daconstenla/dregsy that referenced this pull request Jun 16, 2022
@daconstenla
Copy link
Contributor Author

@xelalexv just rebased the code, could we take a look at the current state? remarks? changes?

@daconstenla
Copy link
Contributor Author

Just removed all the non related to the task at hand changes and squashed it to a single commit 👌

@daconstenla daconstenla force-pushed the add-dry-run-command branch 2 times, most recently from 40fdc27 to 49a2cd8 Compare June 20, 2022 12:18
@daconstenla
Copy link
Contributor Author

Just rebased it again 👌

@xelalexv
Copy link
Owner

I'm preparing a release just now, so you'll unfortunately have to rebase once more, but should be trivial. With the release out of the way, we can continue the review of this PR.

and obtain the final list of labels that'll be synced as well as tags that
are in the target repo and are not updated based on the current configuration
@daconstenla
Copy link
Contributor Author

Rebased once more 👌

@xelalexv
Copy link
Owner

First my apologies for the really long delay! Right after my vacation, I was totally swamped with work.

Now for the PR. It's great progress! Here are a few comments of things we could improve further:

  1. Since I recently changed the relay.Sync(...) interface to take a SyncOptions struct, it's maybe cleaner now to not persist the dry run flag in relay instances, but add it to the SyncOptions when calling relay.Sync().

  2. I'm not quite sure about dumping the dry run result in to separate output files. I'd prefer a single output. Also, the instantiation of the map you're passing to util.DumpMapAsJson is the same in both relays. Since it's quite large, and the keys contained in the map may change as we develop dry run further, the code duplication is a concern. I think it's justifiable to create a dedicated type for dry run result collection, which could be passed in the sync options, and could also function as the dry run flag (see above).

  3. We should also maintain a hierarchy in the collected data, e.g. instead of saying:

    {
        "task name": "task 1",
        "task index": "0",
        ...
    }

    we could say:

    {
        "task": {
            "name": "task 1",
            "index": "0"
        },
        ...
    }

    I think this could make post-processing of dry run results a lot easier.

  4. It should be possible to choose between YAML & JSON for output.

  5. Have you tried this with source/target registries that require authentication?

@daconstenla
Copy link
Contributor Author

Thanks for the improvement suggestions @xelalexv.

  1. Since I recently changed the relay.Sync(...) interface to take a SyncOptions struct, it's maybe cleaner now to not persist the dry run flag in relay instances, but add it to the SyncOptions when calling relay.Sync().

Does this mean we expect dry-run to be configured on task level? I mean SyncOptions are on task level, right? 🤔

  1. I'm not quite sure about dumping the dry run result in to separate output files. I'd prefer a single output. Also, the instantiation of the map you're passing to util.DumpMapAsJson is the same in both relays. Since it's quite large, and the keys contained in the map may change as we develop dry run further, the code duplication is a concern. I think it's justifiable to create a dedicated type for dry run result collection, which could be passed in the sync options, and could also function as the dry run flag (see above).
  1. We should also maintain a hierarchy in the collected data, e.g. instead of saying:

Got it, let me try to do so.

  1. It should be possible to choose between YAML & JSON for output.

could it be a further iteration?

  1. Have you tried this with source/target registries that require authentication?

yes, I did and it worked so far, actually I've been using this development version for dry-run validations for the last weeks.

@xelalexv
Copy link
Owner

  1. Since I recently changed the relay.Sync(...) interface to take a SyncOptions struct, it's maybe cleaner now to not persist the dry run flag in relay instances, but add it to the SyncOptions when calling relay.Sync().

Does this mean we expect dry-run to be configured on task level? I mean SyncOptions are on task level, right?

Hadn't thought about that. Right now it's just about persisting the dry-run flag cleanly in only one place. Passing this then each time via the SyncOptions would open up a way to do per task dry-runs, amongst other things. Whether per task dry-run is meaningful is a different story.

  1. It should be possible to choose between YAML & JSON for output.

could it be a further iteration?

Of course. We can start with JSON, and then extend later. We just need to keep this in mind during initial implementation.

  1. Have you tried this with source/target registries that require authentication?

yes, I did and it worked so far, actually I've been using this development version for dry-run validations for the last weeks.

Great. I'll try myself later on.

@daconstenla
Copy link
Contributor Author

I'm not quite sure about dumping the dry run result in to separate output files. I'd prefer a single output. Also, the instantiation of the map you're passing to util.DumpMapAsJson is the same in both relays. Since it's quite large, and the keys contained in the map may change as we develop dry run further, the code duplication is a concern. I think it's justifiable to create a dedicated type for dry run result collection, which could be passed in the sync options, and could also function as the dry run flag (see above).
We should also maintain a hierarchy in the collected data, e.g. instead of saying:

I'm trying to think on how to do this and I do remember why I did generated one file per-task. Right now we are calling once per task to s.relay.Sync(...), which means, inside of the function we only know about the currently being executed task, if we would like to generate a map with all the task's results we'd need to return the --dry-run results back to the ./internal/pkg/sync/sync.go to then write all the data at once.

Ideas? comments?

@xelalexv
Copy link
Owner

... I think it's justifiable to create a dedicated type for dry run result collection, which could be passed in the sync options, and could also function as the dry run flag (see above)...

I'm trying to think on how to do this and I do remember why I did generated one file per-task. Right now we are calling once per task to s.relay.Sync(...), which means, inside of the function we only know about the currently being executed task, if we would like to generate a map with all the task's results we'd need to return the --dry-run results back to the ./internal/pkg/sync/sync.go to then write all the data at once.

Ideas? comments?

In the simplest form the collector could for example just be a map that we create at the start of a dry-run, and which we set in SyncOptions each time we sync a task. Each task then adds its dry-run output under its key. When all tasks are done, we can post-process the collected outputs. We could take this further by making this map a type and attach the post-processing functionality to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants