Volume failure preventing apps from connecting to internal service

We have VMs that are failing to connect to other internal services on the fly network today!

Nothing has changed, simply these apps just started failing to connected.

There apps are not running on node alpine images.

What is going on?

I now have it working by fly scale count 1 - but something is 100% wrong with some regions or deployments, not sure what is up.

Cancel that, still getting connection issues, ugh.

I am guessing there has to be an issue internally because the connections are only partially failing, majority of them are, but not all. It also seems to failing based on end user location?

This is also happening on multiple apps, not just a single app.

Fly status page says 100% operational.

We are troubleshooting a disk array in Chicago. Are you using disks with a single VM in Chicago by chance?

We fixed them about 30 min ago, they were periodically inaccessible for some amount of time before that. If the internal services you’re connecting to don’t have two nodes running, they probably went down.

Please try running fly status -a <internal-service> if you’re still having issues and see if the internal services are healthy or not. And if you are running single node apps for production with a disk, you’ll need to add a second node to prevent issues like this.

Yes, both PG and Redis (the internal services) were both running in ORD with volumes.

How do we prevent this from happening again? Is the problem 100% resolved?

Status shows new VMs as of 30 minutes ago for both redis apps that were down.

Status shows new replica VM for PG that went down 20 minutes ago.

You were running in ord with a single volume or running with redundancy? You can’t prevent a single volume from going down when hardware fails. You should run multiple VMs + volumes for redundancy (like we do with Postgres by default).

For Redis, you’ll want to consider KeyDB with multiple VMs. If you don’t need data on Redis to persist (if you’re using it as a cache, for example), you can remove the volume and our system will reschedule if in the event of a hardware failure. Apps relying on a single node Redis will experience outages when hardware fails.

Our ORD postgres app went down, so if by default it has redundant volumes, why did this go down as well?

We are using Redis for schedules jobs (bullMQ).

Are we 100% back?

I don’t know, run fly status on your apps and check. You need to do some debugging before posting in the forums or it’s going to take longer to get help when things have issues. The disks you’re (probably) using have been back for ~45 min now.

If you run fly status -a <postgres-app> we can interpret the output for you as well. It’s unlikely both VMs went down. If one of the postgres VMs did go down, it’s possible your apps continued trying to connect to it when they should have been using other IPs. There are probably nodejs tricks to make this work next time.

With everything in the past with internal fly dns we were looking in a lot of places in the apps that connect to these services. I am not entirely sure how we would have arrived at this finding any faster by avoiding a post in the forum.

This was a hair on fire issue and with multiple services not responding and no where else to go when issues arise, we posted here. We also were checking the fly status page which does not display any issues with a disk array in Chicago, so this also did not help with diagnosing the issue.

It does look like one of the two Postgres VMs is much older than the other, so my guess is only one of the two went down, but it was not immediately elected or it took some time for the internal dns to point to the replica/redundant VM.

As for our node app servers, we simply connect to the app-name.internal address, so not entirely sure how we would manage connecting to the replica automatically in the event the leader has gone down, any ideas here? (using prisma as our client).

1 Like