i’ve had a rails app and a “fly postgres” app deployed on fly for over two years now. last week i decided to scale down my fly postgres app’s memory, since it seemed over provisioned and i’m trying to save a bit on the hosting cost. i ran fly machine update xxxxxx --vm-memory 2048 on it and it worked - the postgres app rebooted at the lower VM memory value and the current deploy of the rails app is connected to it.
however, subsequent deploys of the rails app now fail during the release phase when it tried to run rails db:migrate. here’s the logs i see:
Error release_command failed running on machine 18570e5b704018 with exit code 1.
Checking logs: fetching the last 100 lines below:
2025-02-16T14:42:39Z 2025-02-16T14:42:39.252110039 [01JM7KBXJDW4XERC6079TDBJ12:main] Running Firecracker v1.7.0
2025-02-16T14:42:40Z INFO Starting init (commit: 67f51b8b)...
2025-02-16T14:42:40Z INFO Preparing to run: `./bin/rails db:migrate` as 1000
2025-02-16T14:42:40Z INFO [fly api proxy] listening at /.fly/api
2025-02-16T14:42:40Z Machine started in 1.017s
2025-02-16T14:42:40Z 2025/02/16 14:42:40 INFO SSH listening listen_address=[fdaa:0:f737:a7b:15f:97f1:47c3:2]:22
2025-02-16T14:42:46Z connection to server at "xxxxxx-pg.internal" (fdaa:0:f737:a7b:94:145b:e642:2), port 5432 failed: timeout expired
my first guess was that something subtle changed, such as the PG user’s password rotated, but when i check $OPERATOR_PASSWORD in the shell of the fly postgres app, it matches the password in $DATABASE_URL in my rails app. maybe some network configuration thing has changed?
thanks, yeah i saw reports of the major outage. i first encountered this friday night though - i’m not sure if the incident has been going on longer than is reported.
I’m not sure if the incident has been going on longer than is reported.
I’d guess not; I was playing with the platform all of yesterday, with a set of machines being created, cloned, and destroyed, probably around fifty times over a number of hours. It was very solid.
AFAIK fly checks list looks good on the postgres app:
NAME | STATUS | MACHINE | LAST UPDATED | OUTPUT
-------*---------*----------------*-------------------*--------------------------------------------------------------------------
pg | passing | 73d8d27ea12989 | 2h24m ago | [✓] transactions: read/write (645.18µs)
| | | | [✓] connections: 12 used, 3 reserved, 300 max (4.94ms)
-------*---------*----------------*-------------------*--------------------------------------------------------------------------
role | passing | 73d8d27ea12989 | Feb 15 2025 15:07 | leader
-------*---------*----------------*-------------------*--------------------------------------------------------------------------
vm | passing | 73d8d27ea12989 | 8h4m ago | [✓] checkDisk: 9.06 GB (46.3%) free space on /data/ (39.24µs)
| | | | [✓] checkLoad: load averages: 0.15 0.32 0.69 (62.35µs)
| | | | [✓] memory: system spent 0s of the last 60s waiting on memory (35.23µs)
| | | | [✓] cpu: system spent 798ms of the last 60s waiting on cpu (25.63µs)
| | | | [✓] io: system spent 5.66s of the last 60s waiting on io (23.89µs)
everything passes when i run fly doctor.
i can see that the IPv6 address that it fails connect to is the correct one that matches my single primary/leader fly postgres machine.
i’m able to connect to the machine using fly postgres connect and i see the IPv6 address:
% fly postgres connect -a xxxxx-pg
Connecting to fdaa:0:f737:a7b:94:145b:e642:2... complete
psql (14.4 (Debian 14.4-1.pgdg110+1))
Type "help" for help.
postgres=#
if i SSH console into the rails app, i can connect to the database with psql $DATABASE_URL.
that machine (73d8d27ea12989) is the same one i created over 2 years ago, that i recently updated 8 days ago.
i see the P6N address listed for fly-local-6pn in the fly postgres machine’s /etc/hosts file.
sorry to bump, but would anyone at fly have a suggestion of something to try? i haven’t touched my fly.toml config in 6 months and now i’m currently stuck in a state where i can’t deploy any updates
It’s pretty hard for fellow users to help, mostly because getting into the mind-space of a problem tends to require a console in front of us! See if there are Fly support options available to you, such as sending an email or creating a ticket.
That said, I have noticed something that may be worth remarking on. I see that you’re running a command to do your migrations, and this looks like it has started up a separate VM. Could you create a console on your running Rails app and run your migration from there?
Put in a ten second sleep then do your migrations in case the network is set up after the machine is ready (i.e. the network may appear eventually)
Put in an 4 hour sleep, then try execing into the machine, so you can play with networking, then kill the machine (i.e. the network will never appear and you need the smallest possible reproducible case)
That said, I have noticed something that may be worth remarking on. I see that you’re running a command to do your migrations, and this looks like it has started up a separate VM. Could you create a console on your running Rails app and run your migration from there?
yes, i can console into the running app and run that command. the rails app is actively connected to postgres and functions properly.
Put in a ten second sleep then do your migrations in case the network is set up after the machine is ready (i.e. the network may appear eventually)
this actually worked! i was surprised though, as i haven’t touched my fly.toml in quite some time. it was:
and it ran after a brief sleep (it’s on alpine). i’m still confused why this just started happening after i scaled down my postgres vm’s memory, but perhaps that was just a red herring?
Yeah, it could be. I think Fly is ace, but I’d guess that with the pace of product development, subtle changes in platform behaviour will still be a thing for a while.