Remote builds consistently failing with gpc error

jsierles · May 6, 2021, 11:29am

Today I can’t get remote builds to work. Every attempt results in this error:

#3 resolve image config for docker.io/docker/dockerfile:experimental
#3 sha256:401713457b113a88eb75a6554117f00c1e53f1a15beec44e932157069ae9a9a3
#3 ERROR: rpc error: code = Canceled desc = grpc: the client connection is closing
------
 > resolve image config for docker.io/docker/dockerfile:experimental:
------
Error error building: failed to solve with frontend dockerfile.v0: failed to solve with frontend gateway.v0: rpc error: code = Canceled desc = grpc: the client connection is closing

I also just tried a local deploy. The build works, but deploy gets stuck on the Pulling container image step of the release task run.

UPDATE: The local deploy finally finished after about 30 minutes.

jerome · May 6, 2021, 2:57pm

This is abnormal.

We’ve been working on the remote builders extensively these past few days and have made improvements. We’ll be releasing them very soon.

The slow or failing Pulling container step of the release task is probably a bug. I’m assuming you’re using the unreleased [deploy] config? We need to do some additional testing there

Would you mind telling us your app name? Either here or via DM. It should help us figure out what happened.

jsierles · May 6, 2021, 3:17pm

Yup I’m on the edge! My app is still in ‘getting first customer to login’ phase

App name is ensayo.

One other comment: the CLI UX is a bit confusing when it looks like the release command is running, but it’s actually still going through the motions of pulling, etc, before running the actual command. This may be because the line is fixed to the terminal while actions are displayed above. I’d recommend just making this a linear flow, or at last having the steps show up below the fixed line.

jerome · May 6, 2021, 3:51pm

I think this particular instance of slowness was our backing store (S3) for the registry being slow as hell from AMS (where your app is deployed).

We have a few ideas to make this better (multiple S3 regions, for example). Ultimately, we need to get our act together and offer distributed storage on our platform and use it with our registry.

jsierles · May 6, 2021, 4:10pm

Sounds reasonable! This is the architecture I was trying to get on Digital Ocean, but their platform is pretty black box still and the registry is only in one region for all (AMS ).

S3 compatible storage would be killer, but obviously a huge challenge. It appears to be what goes down the most at DO, or that runs out of capacity. I think an interesting way to solve that problem would be to only offer distributed storage to customers using the VM platform.

Meanwhile, to avoid dependency on S3, it might be helpful to allow deploying from a registry hosted inside Fly on a regular VM with block storage. I saw your were hesitant to allow deployment from 3rd party registries, which is understandable. But allowing us to run a private registry and push to it might be a medium term solution.

nahtnam · May 7, 2021, 6:32am

I’m also seeing something similar, the build fails after 5 minutes. After re-running once or twice it works.

Error error connecting to docker: Could not ping remote builder within 5 minutes, aborting.

michael · May 7, 2021, 4:09pm

Could you try updating flyctl? We made a change yesterday evening to use wireguard for the connection which should be faster and more reliable.

jsierles · May 7, 2021, 5:41pm

I did and now am seeing a new error:

	 Running: `bundle exec rake db:migrate` as root
	 2021/05/07 17:30:54 listening on [fdaa:0:22b7:a7b:aa3:466e:9f67:2]:22 (DNS: [fdaa::3]:53)
	 bundler: failed to load command: rake (/app/vendor/bundle/ruby/3.0.0/bin/rake)
	 /usr/lib/fullstaq-ruby/versions/3.0.0-jemalloc/lib/ruby/3.0.0/bundler/spec_set.rb:87:in `block in materialize': Could not find ast-2.4.2 in any of the sources (Bundler::GemNotFound)

This suggests something may have changed with the BUNDLE_PATH env var which gets set in the Dockerfile. Might something have changed with the environment deploy command?

# syntax = docker/dockerfile:experimental

ARG RUBY_VERSION=3.0.0-jemalloc
FROM quay.io/evl.ms/fullstaq-ruby:${RUBY_VERSION}-slim as build

ARG RAILS_ENV=production
ARG RAILS_MASTER_KEY
ENV RAILS_ENV=${RAILS_ENV}
ENV BUNDLE_PATH vendor/bundle
ENV RAILS_MASTER_KEY=${RAILS_MASTER_KEY}

# Reinstall runtime dependencies that need to be installed as packages

RUN --mount=type=cache,id=apt-cache,sharing=locked,target=/var/cache/apt \
    --mount=type=cache,id=apt-lib,sharing=locked,target=/var/lib/apt \
    apt-get update -qq && \
    apt-get install --no-install-recommends -y \
    postgresql-client file rsync git build-essential libpq-dev wget vim curl gzip xz-utils \
    && rm -rf /var/lib/apt/lists /var/cache/apt/archives

RUN gem install -N bundler -v 2.2.16

RUN mkdir /app
WORKDIR /app

# Install rubygems
COPY Gemfile* ./

COPY bin/rsync-cache bin/rsync-cache

ENV BUNDLE_WITHOUT development:test

RUN --mount=type=cache,target=/cache,id=bundle \
    bin/rsync-cache /cache vendor/bundle "bundle install"

ENV PATH $PATH:/usr/local/bin

RUN curl -sO https://nodejs.org/dist/v16.0.0/node-v16.0.0-linux-x64.tar.xz && cd /usr/local && tar --strip-components 1 -xvf /app/node*xz && rm /app/node*xz && cd ~
RUN npm install -g yarn

COPY package.json yarn.lock ./

RUN --mount=type=cache,target=/cache,id=node \
    bin/rsync-cache /cache node_modules "yarn"

COPY . .

ENV NODE_ENV production

RUN bin/esbuild
RUN yarn run typed-content-hash --dir public/packs

RUN rm -rf node_modules vendor/bundle/ruby/*/cache

FROM quay.io/evl.ms/fullstaq-ruby:${RUBY_VERSION}-slim

ARG RAILS_ENV=production

RUN --mount=type=cache,id=apt-cache,sharing=locked,target=/var/cache/apt \
    --mount=type=cache,id=apt-lib,sharing=locked,target=/var/lib/apt \
    apt-get update -qq && \
    apt-get install --no-install-recommends -y \
    postgresql-client file git wget vim curl gzip \
    && rm -rf /var/lib/apt/lists /var/cache/apt/archives

ENV RAILS_ENV=${RAILS_ENV}
ENV RAILS_SERVE_STATIC_FILES true
ENV BUNDLE_PATH vendor/bundle
ENV RAILS_MASTER_KEY=${RAILS_MASTER_KEY}

COPY --from=build /usr/local/bin/ffmpeg /usr/local/bin/ffmpeg
COPY --from=build /usr/local/bin/ffprobe /usr/local/bin/ffprobe
COPY --from=build /app /app

WORKDIR /app

EXPOSE 8080

CMD ["bundle", "exec", "puma", "-C", "config/puma.rb"]

michael · May 7, 2021, 5:54pm

I think that’s unrelated. Yesterdays fix was for the EOFs and timeouts with remote builders. flyctl is now connecting to builders over userland wireguard tunnels instead of anycast to an auth proxy in the builder. Shouldn’t impact the build environment.

jerome · May 7, 2021, 5:57pm

We did change another thing, we’ve upgraded the kernels to 5.12.2. Not sure if that could affect this.

jsierles · May 8, 2021, 4:49am

OK, thanks. I believe the issue is related to the Docker build cache (not layer cache). I was able to build locally after busting this cache. The undocumented no-cache option didn’t work with the remote builder, so I tried removing the remote builder volume. I recognize this may not have been smart, as now the remote builder won’t boot Any suggestions?

UPDATE: After writing this, I tried deleting the remote builder app, leading to a new one being born and building with a fresh cache!

jerome · May 8, 2021, 11:26am

Hah, yes they’re disposable and will be automatically created if none exist. Deleting the whole app was the right call!

nate · May 8, 2021, 1:21pm

I’m seeing this I think, pushes until complete then just goes back to retry again, only for a few of my images. Not using builder, just pushing, but tried restarting apps/builder and no luck yet.

flyctl v0.0.216 linux/amd64 Commit: 539d4cf BuildDate: 2021-05-06T21:26:41Z

nahtnam · May 8, 2021, 7:51pm

This was using the github action: superfly/flyctl-actions@1.1

TCP 80/443 ⇢ 5000
Waiting for remote builder fly-builder-red-glitter-####...
Creating WireGuard peer "interactive-d63c146edb27-myemail-gmail-com-688" in region "iad" for organization ludicrous
Error error connecting to docker: Could not ping remote builder within 5 minutes, aborting.

jerome · May 9, 2021, 1:13am

Is this still happening?

We changed our setup to add a caching frontend to S3 (our registry’s storage). I wonder if it’s related.

jsierles · May 9, 2021, 9:32pm

This is working for me now.

jsierles · May 9, 2021, 9:33pm

Just should mention that the gem install issue mentioned here was fixed by repeating this line in the final build stage:

ENV BUNDLE_WITHOUT development:test

michael · May 10, 2021, 3:45pm

That makes sense. The Heroku ruby buildpack also bundles without dev and test groups.

jsierles · May 10, 2021, 4:12pm

Yeah. The problem here is that my Dockerfile uses an environment variable to set the bundle groups. bundle exec also needs this variable to be set, apparently. That variable is not inherited by the final stage, so must be repeated.

I’m not sure if this is new behavior in a recent Bundler version, but good to know about!

michael · May 10, 2021, 4:13pm

Oh interesting, good to know. Thanks for sharing!

Topic		Replies	Views
Build errors	1	177	February 21, 2024
Fly.io deploy and flyctl is extremely broken Build debugging postgres , django , flyctl	7	837	February 19, 2024
Failed deploy: rpc error: code = Canceled desc = grpc: the client connection is closing registry	2	604	March 25, 2024
fly deploy: grpc: the client connection is closing Questions / Help	1	466	February 28, 2024
Good old failed to fetch an image on deploy Questions / Help	7	1444	October 19, 2023

Remote builds consistently failing with gpc error

Related topics