Hey folks! As many of you have experienced there have been some problems deploying from github actions with the default docker buildx version as of 2 weeks ago. The good news is that this is now fixed! That said, I want to share some background for what actually went wrong.
Part 1: Buildx v0.10
On January 9th Buildx v0.10 was released. The release notes included the following warning:
Buildx v0.10 enables support for a minimal SLSA Provenance attestation, which requires support for OCI-compliant multi-platform images. This may introduce issues with registry and runtime support (e.g. Google Cloud Run and Lambda). You can optionally disable the default provenance attestation functionality using
--provenance=false
.
As you might imagine, the issue described extended beyond GCR and Lambda to Fly.io as well. Despite this mildly breaking change there was little notice taken as most people and platforms remained on Buildx 0.9.
Part 2: The GitHub upgrade
As best we can tell, on January 19th GitHub upgraded the default version of buildx used in their docker build actions from v0.9 to v0.10 (as they should). This meant that images built on GitHub actions in their default configurations started failing to run on GCR, Lambda, and Fly.io. We started to take notice as people were describing similar errors from their deploys, they looked like this:
Searching for image 'registry.fly.io/example:tag' remotely...
Error failed to fetch an image or build from source: Could not find image "registry.fly.io/example:tag"
We began investigating (and for me at least, tearing out my hair) trying to understand that error. Could not find image
? We could see that the images existed in our registry, and yet for some reason our api was failing to find them? What was going on?!
Part 3: The workarounds
We eventually narrowed down the cause to the Buildx upgrade and recommended people downgrade. Shortly after we found more context about the changes and were able to recommend people disable the provenance
feature of Buildx.
Part 4: The real fix
For those who don’t know, our API (what flyctl interacts with) is a monolithic Rails application. To interact with our docker registry we use the excellent docker_registry2 gem. In a minimal test example I was able to reproduce the gem failing to locate an image manifest I knew for a fact existed.
I decided to manually replicate the HTTP requests the gem was making. Lo and behold I got a 404 fetching manifests built from the latest Buildx, but thats not all I saw. There was an error message along for the ride: OCI index found, but accept header does not support OCI indexes
. When I saw that everything clicked: docker Buildx was now generating “multi-platform” OCI images by default, not “docker” images.
Docker Registries require clients to pass an Accept
header specifying what type of image resources it can handle. For example for Docker image manifests have the type of vnd.docker.distribution.manifest.v2+json
where OCI image manifest have the type of vnd.oci.image.manifest.v1+json
.
The distinction is a subtle one but when I checked the code docker_registry2
was only setting Accept headers for docker image manifests and manifest lists, not OCI manifests and manifest lists.
So by adding the new OCI Accept headers to docker_registry2
, we are once again able to process images built from the default Buildx.
This post was kind of a wild ride but I’m just pleased this bug is fixed. Its been plaguing me Let me know if any of y’all catch any remaining issues.