Remote builds consistently failing with gpc error

Today I can’t get remote builds to work. Every attempt results in this error:

#3 resolve image config for
#3 sha256:401713457b113a88eb75a6554117f00c1e53f1a15beec44e932157069ae9a9a3
#3 ERROR: rpc error: code = Canceled desc = grpc: the client connection is closing
 > resolve image config for
Error error building: failed to solve with frontend dockerfile.v0: failed to solve with frontend gateway.v0: rpc error: code = Canceled desc = grpc: the client connection is closing

I also just tried a local deploy. The build works, but deploy gets stuck on the Pulling container image step of the release task run.

UPDATE: The local deploy finally finished after about 30 minutes.

This is abnormal.

We’ve been working on the remote builders extensively these past few days and have made improvements. We’ll be releasing them very soon.

The slow or failing Pulling container step of the release task is probably a bug. I’m assuming you’re using the unreleased [deploy] config? We need to do some additional testing there :slight_smile:

Would you mind telling us your app name? Either here or via DM. It should help us figure out what happened.

Yup I’m on the edge! My app is still in ‘getting first customer to login’ phase :laughing:

App name is ensayo.

One other comment: the CLI UX is a bit confusing when it looks like the release command is running, but it’s actually still going through the motions of pulling, etc, before running the actual command. This may be because the line is fixed to the terminal while actions are displayed above. I’d recommend just making this a linear flow, or at last having the steps show up below the fixed line.

I think this particular instance of slowness was our backing store (S3) for the registry being slow as hell from AMS (where your app is deployed).

We have a few ideas to make this better (multiple S3 regions, for example). Ultimately, we need to get our act together and offer distributed storage on our platform and use it with our registry.

Sounds reasonable! This is the architecture I was trying to get on Digital Ocean, but their platform is pretty black box still and the registry is only in one region for all (AMS :laughing: ).

S3 compatible storage would be killer, but obviously a huge challenge. It appears to be what goes down the most at DO, or that runs out of capacity. I think an interesting way to solve that problem would be to only offer distributed storage to customers using the VM platform.

Meanwhile, to avoid dependency on S3, it might be helpful to allow deploying from a registry hosted inside Fly on a regular VM with block storage. I saw your were hesitant to allow deployment from 3rd party registries, which is understandable. But allowing us to run a private registry and push to it might be a medium term solution.

I’m also seeing something similar, the build fails after 5 minutes. After re-running once or twice it works.

Error error connecting to docker: Could not ping remote builder within 5 minutes, aborting.

Could you try updating flyctl? We made a change yesterday evening to use wireguard for the connection which should be faster and more reliable.

I did and now am seeing a new error:

	 Running: `bundle exec rake db:migrate` as root
	 2021/05/07 17:30:54 listening on [fdaa:0:22b7:a7b:aa3:466e:9f67:2]:22 (DNS: [fdaa::3]:53)
	 bundler: failed to load command: rake (/app/vendor/bundle/ruby/3.0.0/bin/rake)
	 /usr/lib/fullstaq-ruby/versions/3.0.0-jemalloc/lib/ruby/3.0.0/bundler/spec_set.rb:87:in `block in materialize': Could not find ast-2.4.2 in any of the sources (Bundler::GemNotFound)

This suggests something may have changed with the BUNDLE_PATH env var which gets set in the Dockerfile. Might something have changed with the environment deploy command?

# syntax = docker/dockerfile:experimental

ARG RUBY_VERSION=3.0.0-jemalloc
FROM${RUBY_VERSION}-slim as build

ARG RAILS_ENV=production
ENV BUNDLE_PATH vendor/bundle

# Reinstall runtime dependencies that need to be installed as packages

RUN --mount=type=cache,id=apt-cache,sharing=locked,target=/var/cache/apt \
    --mount=type=cache,id=apt-lib,sharing=locked,target=/var/lib/apt \
    apt-get update -qq && \
    apt-get install --no-install-recommends -y \
    postgresql-client file rsync git build-essential libpq-dev wget vim curl gzip xz-utils \
    && rm -rf /var/lib/apt/lists /var/cache/apt/archives

RUN gem install -N bundler -v 2.2.16

RUN mkdir /app

# Install rubygems
COPY Gemfile* ./

COPY bin/rsync-cache bin/rsync-cache

ENV BUNDLE_WITHOUT development:test

RUN --mount=type=cache,target=/cache,id=bundle \
    bin/rsync-cache /cache vendor/bundle "bundle install"

ENV PATH $PATH:/usr/local/bin

RUN curl -sO && cd /usr/local && tar --strip-components 1 -xvf /app/node*xz && rm /app/node*xz && cd ~
RUN npm install -g yarn

COPY package.json yarn.lock ./

RUN --mount=type=cache,target=/cache,id=node \
    bin/rsync-cache /cache node_modules "yarn"

COPY . .

ENV NODE_ENV production

RUN bin/esbuild
RUN yarn run typed-content-hash --dir public/packs

RUN rm -rf node_modules vendor/bundle/ruby/*/cache


ARG RAILS_ENV=production

RUN --mount=type=cache,id=apt-cache,sharing=locked,target=/var/cache/apt \
    --mount=type=cache,id=apt-lib,sharing=locked,target=/var/lib/apt \
    apt-get update -qq && \
    apt-get install --no-install-recommends -y \
    postgresql-client file git wget vim curl gzip \
    && rm -rf /var/lib/apt/lists /var/cache/apt/archives

ENV BUNDLE_PATH vendor/bundle

COPY --from=build /usr/local/bin/ffmpeg /usr/local/bin/ffmpeg
COPY --from=build /usr/local/bin/ffprobe /usr/local/bin/ffprobe
COPY --from=build /app /app



CMD ["bundle", "exec", "puma", "-C", "config/puma.rb"]

I think that’s unrelated. Yesterdays fix was for the EOFs and timeouts with remote builders. flyctl is now connecting to builders over userland wireguard tunnels instead of anycast to an auth proxy in the builder. Shouldn’t impact the build environment.

We did change another thing, we’ve upgraded the kernels to 5.12.2. Not sure if that could affect this.

OK, thanks. I believe the issue is related to the Docker build cache (not layer cache). I was able to build locally after busting this cache. The undocumented no-cache option didn’t work with the remote builder, so I tried removing the remote builder volume. I recognize this may not have been smart, as now the remote builder won’t boot :smiley: Any suggestions?

UPDATE: After writing this, I tried deleting the remote builder app, leading to a new one being born and building with a fresh cache!

1 Like

Hah, yes they’re disposable and will be automatically created if none exist. Deleting the whole app was the right call!

I’m seeing this I think, pushes until complete then just goes back to retry again, only for a few of my images. Not using builder, just pushing, but tried restarting apps/builder and no luck yet.

flyctl v0.0.216 linux/amd64 Commit: 539d4cf BuildDate: 2021-05-06T21:26:41Z

This was using the github action: superfly/flyctl-actions@1.1

TCP 80/443 ⇢ 5000
Waiting for remote builder fly-builder-red-glitter-####...
Creating WireGuard peer "interactive-d63c146edb27-myemail-gmail-com-688" in region "iad" for organization ludicrous
Error error connecting to docker: Could not ping remote builder within 5 minutes, aborting.

Is this still happening?

We changed our setup to add a caching frontend to S3 (our registry’s storage). I wonder if it’s related.

This is working for me now.

1 Like

Just should mention that the gem install issue mentioned here was fixed by repeating this line in the final build stage:

ENV BUNDLE_WITHOUT development:test


That makes sense. The Heroku ruby buildpack also bundles without dev and test groups.

Yeah. The problem here is that my Dockerfile uses an environment variable to set the bundle groups. bundle exec also needs this variable to be set, apparently. That variable is not inherited by the final stage, so must be repeated.

I’m not sure if this is new behavior in a recent Bundler version, but good to know about!

1 Like

Oh interesting, good to know. Thanks for sharing!