Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression using ubuntu linux/amd64 host with linux/386 container #7695

Closed
3 of 10 tasks
molinav opened this issue Jun 9, 2023 · 11 comments
Closed
3 of 10 tasks

Regression using ubuntu linux/amd64 host with linux/386 container #7695

molinav opened this issue Jun 9, 2023 · 11 comments
Assignees

Comments

@molinav
Copy link

molinav commented Jun 9, 2023

Description

I observed that one of the project workflows I maintain is not able anymore to build 32-bit packages on 64-bit GNU/Linux hosts and the only thing that has changed is the GitHub runner image version:

  • Last time I ran the workflow (2023-05-04), a GNU/Linux runner (x64) with ubuntu-latest was able to use a container based on a pulled image with only linux/386 as available arch, and this was allowing to have a 32-bit isolated environment to build 32-build libraries on a 64-bit host. So a linux/amd64 was able to handle the use of linux/386 when the Docker registry was not offering a linux/amd64 image:
Current runner version: '2.304.0'
Operating System
  Ubuntu
  22.04.2
  LTS
Runner Image
  Image: ubuntu-22.04
  Version: 20230426.1
  Included Software: https://github.com/actions/runner-images/blob/ubuntu22/20230426.1/images/linux/Ubuntu2204-Readme.md
  Image Release: https://github.com/actions/runner-images/releases/tag/ubuntu22%2F20230426.1
Runner Image Provisioner
  2.0.168.1
GITHUB_TOKEN Permissions
Secret source: Actions
Prepare workflow directory
Prepare all required actions
Getting action download info
Download action repository 'actions/download-artifact@v1' (SHA:18f0f591fbc635562c815484d73b6e8e3980482e)
Download action repository 'actions/upload-artifact@v1' (SHA:3446296876d12d4e3a0f3145a3c87e67bf0a16b5)
Complete job name: build-geos (x86)
  • Today (2023-06-09), the same workflow is failing without changes in the workflow file. The reason for failing is that the linux/amd64 host now only allows to pull linux/amd64 images, so when a user now provides a registry with only linux/386 images, the runner will complain because it cannot find any linux/amd64 image to pull (and it will not try to pull the linux/386 image):
Current runner version: '2.304.0'
Operating System
  Ubuntu
  22.04.2
  LTS
Runner Image
  Image: ubuntu-22.04
  Version: 20230517.1
  Included Software: https://github.com/actions/runner-images/blob/ubuntu22/20230517.1/images/linux/Ubuntu2204-Readme.md
  Image Release: https://github.com/actions/runner-images/releases/tag/ubuntu22%2F20230517.1
Runner Image Provisioner
  2.0.171.1
GITHUB_TOKEN Permissions
Secret source: Actions
Prepare workflow directory
Prepare all required actions
Getting action download info
Download action repository 'actions/download-artifact@v1' (SHA:18f0f591fbc635562c815484d73b6e8e3980482e)
Download action repository 'actions/upload-artifact@v1' (SHA:3446296876d12d4e3a0f3145a3c87e67bf0a16b5)
Complete job name: build-geos (x86)

Passing the --platform option together with the container setup is not an option, because this option and its argument are not passed to the docker pull call during the container preparation and an issue pointing to this problem was closed long ago (actions/runner#648).

Platforms affected

  • Azure DevOps
  • GitHub Actions - Standard Runners
  • GitHub Actions - Larger Runners

Runner images affected

  • Ubuntu 20.04
  • Ubuntu 22.04
  • macOS 11
  • macOS 12
  • macOS 13
  • Windows Server 2019
  • Windows Server 2022

Image version and build link

Before (working, 20230426.1): https://github.com/matplotlib/basemap/actions/runs/4884953600/jobs/8718596379
Now (failing, 20230517.1): https://github.com/matplotlib/basemap/actions/runs/5218138554/jobs/9418704735

Is it regression?

Yes, because with runner image version 20230426.1 it was working.

Expected behavior

The ubuntu-latest 64-bit runners should be able to run linux/386 containers as before.

Actual behavior

The ubuntu-latest 64-bit runners are failing because they do not identify linux/386 as a valid architecture.

Repro steps

The workflow below reproduces the bug:
https://github.com/matplotlib/basemap/blob/v1.3.7/.github/workflows/basemap-for-manylinux.yml

In particular, the following job is enough, it does not even start because the container cannot be created:
https://github.com/matplotlib/basemap/blob/v1.3.7/.github/workflows/basemap-for-manylinux.yml#LL78-L125

@vpolikarpov-akvelon
Copy link
Contributor

Hey @molinav. Thank you for reporting. We will investigate it.

@vpolikarpov-akvelon
Copy link
Contributor

vpolikarpov-akvelon commented Jun 13, 2023

Hey @molinav. We updated some underlying infrastructure that may relate to this issue. Could you try running your workflow again?

@molinav
Copy link
Author

molinav commented Jun 13, 2023

Hi @vpolikarpov-akvelon. Unfortunately the problem is still triggered (Runner Image Provisioner is now 2.0.226.1), see below:
https://github.com/matplotlib/basemap/actions/runs/5256661680/jobs/9498570331

Current runner version: 2.304.0
Operating System
  Ubuntu
  22.04.2
  LTS
Runner Image
  Image: ubuntu-22.04
  Version: 20230517.1
  Included Software: https://github.com/actions/runner-images/blob/ubuntu22/20230517.1/images/linux/Ubuntu2204-Readme.md
  Image Release: https://github.com/actions/runner-images/releases/tag/ubuntu22%2F20230517.1
Runner Image Provisioner
  2.0.226.1
GITHUB_TOKEN Permissions
  Actions: write
  Checks: write
  Contents: write
  Deployments: write
  Discussions: write
  Issues: write
  Metadata: read
  Packages: write
  Pages: write
  PullRequests: write
  RepositoryProjects: write
  SecurityEvents: write
  Statuses: write
Secret source: Actions
Prepare workflow directory
Prepare all required actions
Getting action download info
Download action repository 'actions/download-artifact@v1' (SHA:18f0f591fbc635562c815484d73b6e8e3980482e)
Download action repository 'actions/upload-artifact@v1' (SHA:3446296876d12d4e3a0f3145a3c87e67bf0a16b5)
Complete job name: build-geos (x86)
Checking docker version
  /usr/bin/docker version --format '{{.Server.APIVersion}}'
  '1.41'
  Docker daemon API version: '1.41'
  /usr/bin/docker version --format '{{.Client.APIVersion}}'
  '1.41'
  Docker client API version: '1.41'
Clean up resources from previous jobs
  /usr/bin/docker ps --all --quiet --no-trunc --filter "label=ed866e"
  /usr/bin/docker network prune --force --filter "label=ed866e"
Create local container network
  /usr/bin/docker network create --label ed866e github_network_9547124535194f69a2c677db1907e35a
6925b09582f071c74d6c21b1ab7f99ce765195ba00475d3eceacef9aceb785de
Starting job container
  /usr/bin/docker pull pylegacy/x86-python:3.6-debian-4
  no matching manifest for linux/amd64 in the manifest list entries
  3.6-debian-4: Pulling from pylegacy/x86-python
  Warning: Docker pull failed with exit code 1, back off 7.413 seconds before retry.
  /usr/bin/docker pull pylegacy/x86-python:3.6-debian-4
  3.6-debian-4: Pulling from pylegacy/x86-python
  no matching manifest for linux/amd64 in the manifest list entries
  Warning: Docker pull failed with exit code 1, back off 7.159 seconds before retry.
  /usr/bin/docker pull pylegacy/x86-python:3.6-debian-4
  3.6-debian-4: Pulling from pylegacy/x86-python
  no matching manifest for linux/amd64 in the manifest list entries
  Error: Docker pull failed with exit code 1

@molinav
Copy link
Author

molinav commented Jun 13, 2023

I also tested on my personal computer (Windows 10 Pro x64, WSL with Debian 11) to ensure that the linux/386 image can actually be run from my 64-bit machine.

On Windows with Docker Desktop + Linux containers:

[vic@onyx] C:\Users\vic> docker run pylegacy/x86-python:3.6-debian-4 sh -c 'echo "Hello, world!"'
Unable to find image 'pylegacy/x86-python:3.6-debian-4' locally
3.6-debian-4: Pulling from pylegacy/x86-python
138bac1fe8c9: Pull complete
8ea2a5bcb8cc: Pull complete
44710418a973: Pull complete
9a878bf3e276: Pull complete
8c2d7412451a: Pull complete
Digest: sha256:1ec7445d6482d32da785550a660a014124e97eceb63d8bbb6edbd663fa5abe28
Status: Downloaded newer image for pylegacy/x86-python:3.6-debian-4
Hello, world!
[vic@onyx] C:\Users\vic>

On WSL with Docker CLI:

vic@onyx:~$ docker run pylegacy/x86-python:3.6-debian-4 sh -c 'echo "Hello, world!"'
Unable to find image 'pylegacy/x86-python:3.6-debian-4' locally
3.6-debian-4: Pulling from pylegacy/x86-python
138bac1fe8c9: Already exists
8ea2a5bcb8cc: Already exists
44710418a973: Already exists
9a878bf3e276: Already exists
8c2d7412451a: Already exists
Digest: sha256:1ec7445d6482d32da785550a660a014124e97eceb63d8bbb6edbd663fa5abe28
Status: Downloaded newer image for pylegacy/x86-python:3.6-debian-4
Hello, world!
vic@onyx:~$

@molinav
Copy link
Author

molinav commented Jun 27, 2023

To keep this alive, I have been trying to run the same workflows when I saw that new runner images were available, and the exact same problem persists in all of them (last runner version was 20230619.1.0).

@vpolikarpov-akvelon
Copy link
Contributor

Hey, @molinav. I have carefully investigated the information you provided once more.

I noticed that there were only three successful builds on May 4 and May 5. Two weeks later, on May 18, the docker image pylegacy/x86-python was updated. The workflow failures started on June 9. Since we haven't made any significant changes on our end, I suspect that the problem might be caused by the image update. Unfortunately, I couldn't access the version of the image that was before May 18, but if it was mistakenly built for amd64 instead of 386, it would explain why you had successful builds before. I suggest checking if this was the case.

In any case, the runner pulls the image using a plain docker pull: link to source code. There doesn't seem to be any logic to manually specify the platform, and it appears that there never was. I also couldn't find any Docker daemon options that configure fallback behavior for the container platform.

Regarding your local PC, the reason you can pull the image without explicitly specifying the platform may be due to the Docker version. The Docker version on GitHub-hosted runners is currently 20.10.25+azure-2 while the current latest version is 24.0.2. If you are using Docker Desktop, the behavior may differ even more.

Considering all this information, I don't believe it is related to the runner image update. If there is something I overlooked, please let us know in the comment.

@molinav
Copy link
Author

molinav commented Jun 28, 2023

Thanks for the feedback, @vpolikarpov-akvelon!

The Docker image update on May 18 should be related to a rebuild of the same Dockerfile with the latest Python versions built from source (very likely the Python patch versions were different for the still-supported Python versions).

As you indicated, the plain docker pull call has been there for a while, without special platform argument or environment variable. However, it seems to me that the target architecture of my images looks correct, based on the GitHub Actions from before, I explain myself:

  • On May 5, docker pull was still able to pull i386 images from amd64 hosts. I would understand this because the docker pull call did not fail, and I was getting the following warning by the subsequent docker create call (still present in the logs):
WARNING: The requested image's platform (linux/386) does not match the detected host platform (linux/amd64) and no specific platform was requested

so the docker pull actually received an i386 image, and the warning was thrown afterwards because an i386 image was used to create a container on an amd64 host without providing "--platform linux/386" explicitly in the docker create call. The reason for getting an i386 image is that pylegacy/x86-python only provides i386 images, so no amd64 images can be pulled (this is intended behaviour).

  • On June 9, docker pull is not able to pull i386 images from amd64 hosts. Pulling from pylegacy/x86-python, which only provides i386 images, with an amd64 host fails because docker is not understanding that the i386 images can also be pulled by the amd64 host:
no matching manifest for linux/amd64 in the manifest list entries

so the container initialisation is aborted because no amd64 images are found on pylegacy/x86-python, and docker pull does not seem to look anymore at the presence of i386 images. The fact that pylegacy/x86-python:3.6-debian-4 is an i386-only image (independenly from my update on May 18) can be seen in Docker hub below:
https://hub.docker.com/layers/pylegacy/x86-python/3.6-debian-4/images/sha256-91bc1c1b2e60948144cc32d5009827f2bf5331c51d43ff4c4ebfe43b0b9e7843

I hope that I could clarify a bit better the behaviour that I was seeing at the beginning of May with respect to the behaviour that I see since June. Could it be that the Docker version has changed, and the latest Docker version in the runner images has a different behaviour on what to do in this multi-platform cases?

My tests on Windows and WSL2 were done with Docker Desktop, which is providing Docker 24.0.2 at the moment.

It seems that it is possible to bypass the default Docker platform used when pulling through the DOCKER_DEFAULT_PLATFORM environment variable, so if I could do this in the host that is initialising the container in the GitHub Action, probably my workflows would work again:

export DOCKER_DEFAULT_PLATFORM=linux/386

but I could not figure out how to do this (if it is even possible), because my exported environment variables are only being set after the creation of the job container.

@vpolikarpov-akvelon
Copy link
Contributor

Well, I tried to revert moby-engine upgrade that took place here on VM created from runner image and it helped indeed. Looks like until version 20.10.25 moby-engine ignored arch completely during pull and could download non-matching arch even when platform is specified explicitly.

I didn't find any related changes in moby-engine changelog, but I think it may be caused by the update of dependent package opencontainers/image-spec from v1.0.3 to v1.1.0-rc2. The new version of image-spec introduces additional annotations for arch. So the behavior we are talking about seems to be a new feature, but not a bug.

We can't pin version of moby-engine package, therefore the only way for you to restore functionality you have lost is to request it in runner repo. I think this feature may be re-requested taking into account new information and recent spec updates.

As a workaround I can suggest using images hashes, like this:

  build-geos:
    strategy:
      matrix:
        image:
          - "pylegacy/x64-python:3.6-debian-4@sha256:41f8377e5294575bae233cc2370c6d4168c4fa0b24e83af4245b64b4940d572d"
          - "pylegacy/x86-python:3.6-debian-4@sha256:91bc1c1b2e60948144cc32d5009827f2bf5331c51d43ff4c4ebfe43b0b9e7843"

It's quite dumb, I know, but I don't see any other options for now.

@molinav
Copy link
Author

molinav commented Jun 29, 2023

Thanks for your detailed analysis, @vpolikarpov-akvelon. I am currently inspecting other sources of issues, since my naive rebuild of the Docker images (which you pointed yesterday) could also have caused some impact which I was not being aware of.

It seems that since buildkit v0.10.0, buildx is generating multi-platform manifests even if only a single architecture is built. The workflows building my Docker images use a buildkit setup action that fetches the latest buildkit available (no version pinning). So it could be that my Docker images before May were built using an older buildkit (0.9.1?) and generating old manifest formats, and the ones after May are built using this multi-platform manifest format for single-platform images.

I could find similar issues and pull requests from last weeks (docker/buildx#1533, open-policy-agent/opa#6052, freelawproject/courtlistener#2830 (comment)). I am currently rebuilding my Docker images with the --provenance false switch. When they are ready, I will try to repeat my failing Python workflows and see if this is a possible workaround.

@molinav
Copy link
Author

molinav commented Jul 2, 2023

@vpolikarpov-akvelon I think I can confirm the source of the issue and it is not related to any runner-images update but, as you said, an (unexpected) change in my Docker image rebuilds caused by BuildKit. In summary:

  • Old BuildKit (<0.10.0) to build + Old Docker (20.10.x) => ✅ (GitHub Actions situation before I updated the Docker images)
  • New BuildKit (0.11.x) to build + New Docker (24.0.x) => ✅ (my PC local configuration that is working)
  • New BuildKit (0.11.x) to build + Old Docker (20.10.x) => ❌ (GitHub Actions situation after I updated the Docker images)

With the old Docker versions, amd64 hosts can run i386 images if:

  1. The image repository is single-platform (i386-only) and the architecture is compatible with the host architecture. A warning will be raised to the console.
  2. The image repository is multi-platform, the target architecture (i386) is compatible with the host architecture (amd64), and the switch --platform linux/386 is given explicitly. Otherwise, Docker only tries to find amd64 in the multi-platform manifest and raise an error if not found.

The old BuildKit generates single-platform images with buildx when only one platform is passed to the build call. The new BuildKit generates multi-platform images even if only one platform is passed to the build call. So last month I was in situation 1. and everything worked, now I was in situation 2. and everything was failing because the switch --platform linux/386 is not given explicitly. Newer Docker versions (24.0.x) seem to understand that if the amd64 platform is not available in the manifest but i386 platform is, then the i386 image is the one pulled, and when this happens it will not raise any warning as before.

The current workaround that solves my problem is to force --provenance false when using newer BuildKit versions, since this switch will force the generation of single-platform images as with elder BuildKit versions, and then I am back to situation 1.

I rebuilt my Docker images with the --provenance false switch and now the workflows are working again using the latest runner-images pulled by the GitHub Actions, which confirms that the problem was not on the runner-images side:
https://github.com/matplotlib/basemap/actions/runs/5415246195

Thanks for your effort and time, @vpolikarpov-akvelon!

@vpolikarpov-akvelon
Copy link
Contributor

@molinav, thank you for the solution and detailed explanation. As problem seems to be resolved now, I'm closing this thread. Feel free to reach out again if you have other problems or questions.

fjtrujy added a commit to irixxxx/toolchains that referenced this issue Sep 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants