release_command works with two-stage build, but in single-image build private networking fails

I’m deploying a KeystoneJS app. It has successfully deployed before, using a multi-stage build I copied from somewhere, simplified:

FROM node:16-alpine3.14 AS build
WORKDIR /app
COPY . .
RUN npm run build

FROM node:16-alpine3.14
WORKDIR /app
COPY --from=build /app /app
EXPOSE 3000
CMD ["npm", "run", "start"]

And this is fine, with fly.toml having release_command = "npx keystone prisma migrate deploy", it runs migrations as part of deploy just great.

However, that two-stage build is wasteful, it flattens everything into a single layer which prevents layer reuse and forces a 1 GB network transfer on the smallest change.

Switching the container to either

FROM node:16-alpine3.14
WORKDIR /app
COPY . .
RUN npm run build
EXPOSE 3000
CMD ["npm", "run", "start"]

or (trying to stay closer to the original)

FROM node:16-alpine3.14 AS build
WORKDIR /app
COPY . .
RUN npm run build

FROM build
WORKDIR /app
EXPOSE 3000
CMD ["npm", "run", "start"]

both make private networking at deploy time fail:

	 Configuring firecracker
	 Starting virtual machine
	 Starting init (commit: 252b7bd)...
	 Preparing to run: `docker-entrypoint.sh npx keystone prisma migrate deploy` as node
	 2022/04/19 18:49:44 listening on [fdaa:0:57f1:a7b:8aeb:c46d:2b74:2]:22 (DNS: [fdaa::3]:53)
	 Prisma schema loaded from schema.prisma
	 Datasource "postgresql": PostgreSQL database "postgres", schema "cms" at "foo-postgres.internal:5432"
	 Error: P1001: Can't reach database server at `foo-postgres.internal`:`5432`
	 Please make sure your database server is running at `foo-postgres.internal`:`5432`.
	 Startihild exited normally with code: 1
	 Starting clean up.
Error release command failed, deployment aborted

I don’t understand how my changes to the container could break private networking like that. Going back to the two-stage build with COPY --from=build /app /app makes the deploy work, without fail so far. What on earth is going on here?

I think something is wrong with the internal DNS.

I added a ping of the foo-postgres.internal hostname to release_command.

First run: container is built as

FROM node:16-alpine3.14 AS build
...
FROM build
...

result

	 Preparing to run: `docker-entrypoint.sh sh -c /ping.sh && npx keystone prisma migrate deploy` as node
	 2022/04/19 20:09:37 listening on [fdaa:0:57f1:a7b:8aeb:6f22:7ac4:2]:22 (DNS: [fdaa::3]:53)
	 ping: bad address 'foo-postgres.internal'

Second run:

FROM node:16-alpine3.14 AS build
...
FROM node:16-alpine3.14
COPY --from=build /app /app
...

result

	 Preparing to run: `docker-entrypoint.sh sh -c /ping.sh && npx keystone prisma migrate deploy` as node
	 2022/04/19 20:19:49 listening on [fdaa:0:57f1:a7b:8aeb:b14:58d9:2]:22 (DNS: [fdaa::3]:53)
	 PING foo-postgres.internal (fdaa:0:57f1:a7b:21e0:0:bbb4:2): 56 data bytes
	 ping: permission denied (are you root?)
	 Stan child exited normally with code: 1
	 Starting clean up.
	 PING foo-postgres.internal (fdaa:0:57f1:a7b:21e0:0:bbb4:2): 56 data bytes
	 ping: permission denied (are you root?)
	 Stan child exited normally with code: 1
	 Starting clean up.

So yeah, the ping failed because it’s the old school pre-IPPROTO_ICMP kind but the DNS worked. And that’s the only change made to the Containerfile in between the two runs!

Private networking should always be available, so it’s likely something else is going on here with DNS. Alpine has been known to have problems with DNS queries, so it might be useful to try another distro.

That said, to debug this further, you can remove the release command and run fly ssh console to login to the running VM. There you might want to try some command using dig (after apk add bind-tools) like: dig foo-postgres.internal and see what you get.