Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
ecdf4896 app 5 ams run running (leader) 3 total, 2 passing, 1 critical 0 2022-05-21T09:44:26Z
92c064e6 app 5 ams run running (replica) 3 total, 2 passing, 1 critical 0 2022-05-21T09:43:31Z
Instance
ID = ecdf4896
Process =
Version = 5
Region = ams
Desired = run
Status = running (leader)
Health Checks = 3 total, 2 passing, 1 critical
Restarts = 0
Created = 2022-05-21T09:44:26Z
Recent Events
TIMESTAMP TYPE MESSAGE
2022-05-21T09:44:22Z Received Task received by client
2022-05-21T09:44:39Z Task Setup Building Task Directory
2022-05-21T09:44:47Z Started Task started by client
Checks
ID SERVICE STATE OUTPUT
vm app passing HTTP GET http://172.19.4.106:5500/flycheck/vm: 200 OK Output: "[✓] checkDisk: 8.78 GB (59.7%) free space on /data/ (33.64µs)\n[✓] checkLoad: load averages: 0.00 0.02 0.01 (119.65µs)\n[✓] memory: system spent 0s of the last 60s waiting on memory (31.42µs)\n[✓] cpu: system spent 354ms of the last 60s waiting on cpu (15.95µs)\n[✓] io: system spent 0s of the last 60s waiting on io (15.72µs)"
role app passing leader
pg app critical HTTP GET http://172.19.4.106:5500/flycheck/pg: 500 Internal Server Error Output: "failed to connect to proxy: context deadline exceeded
Deployment Status
ID = 838345b7-ecca-73d9-e489-e7a0a4b65b8b
Version = v6
Status = failed
Description = Failed due to unhealthy allocations
Instances = 2 desired, 1 placed, 0 healthy, 1 unhealthy
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
428592c5 app 6 ams run running (replica) 3 total, 3 passing 0 1m37s ago
03c079da app 6 ams run running (replica) 3 total, 2 passing, 1 critical 0 10m49s ago
fly restart rebooted only the replica. I used fly vm restart to reboot the master.
Postgres started working again, though fly status shows Failed due to unhealthy allocations
Deployment Status
ID = 838345b7-ecca-73d9-e489-e7a0a4b65b8b
Version = v6
Status = failed
Description = Failed due to unhealthy allocations
Instances = 2 desired, 1 placed, 0 healthy, 1 unhealthy
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
428592c5 app 6 ams run running (replica) 3 total, 3 passing 0 9m27s ago
03c079da app 6 ams run running (leader) 3 total, 3 passing 0 18m39s ago
What was it and how to ensure it doesn’t happen again? There was no alarm and no clear steps to resolve.
This is super concerning, considering it’s a production app i recently migrated from AWS
Hey there! I’m sorry you ran into this! That error pg app critical HTTP GET http://172.19.4.106:5500/flycheck/pg: 500 Internal Server Error Output: "failed to connect to proxy: context deadline exceeded means that the postgres wasn’t connecting to the proxy correctly, fortunately your restart seems to have resolved that, unfortunately we won’t be able to troubleshoot further why that happened since it was restarted.
You can set up alerts with fly checks handlers to alert you when you a health check fails in the future so that you can catch these things earlier and hopefully get them resolved faster. You can read more about that here!
Unfortunately I couldn’t keep it for a proper dissection, it is a production app, downtime was already about 30 minutes.
Both postgres instances simultaneously failed to connect to proxy.
Maybe the issue was outside these instances?
Why did fly restart only restarted the replica? Why the master hasn’t been restarted?
Why the new replica also had the same issue?
Why the automatic instance rebuild didn’t kick in after health checks failure?
Would it have helped to have a third replica in another region?
It good I was near the console when it happened. It makes me nervous what might happen next time. I know it’s not a managed db service, but it feels the problem was the the infrastructure, not the postgres itself.
I really like to know what can be done to avoid it in the future
I completely understand not being able to leave that app down, again, my apologies that you ran into this!
Unfortunately we don’t know exactly why your DB ran into this issue, but we are looking into it. A third replica probably wouldn’t have helped in this case, but you can configure your app to bypass our proxy and connect to port 5433. This will point to either a replica or the leader and therefore doesn’t guarantee writes will work, but can be a more reliable setup because it’s one less moving part.
Another issue with Postgers happened today. Am I lucky
The replica disappeared today. There were a few DBConnection.ConnectionError. Later I noticed that Metrics page showed that our db halved in size (I assume db size here is a sum of all instances), and Monitoring showed only one instance.
fly scale show returns count: 2 fly status lists one instance.
Why the replica didn’t get restored if it failed? It clearly shows that the platform knows that there should be two instances and only one is running. There are still two volumes. And perhaps a potential data loss waiting to happen if the remaining instance reattaches to the wrong volume
➜ fly scale show
VM Resources for ...
VM Size: shared-cpu-1x
VM Memory: 2 GB
Count: 2
Max Per Region: Not set
➜ fly status
App
Name = ...
Owner = ...
Version = 6
Status = running
Hostname = ....fly.dev
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
03c079da app 6 ams run running (leader) 3 total, 3 passing 0 2022-06-01T12:59:43Z
I ran fly scale count 2 to force it to get back to 2 instances. The instance was down for 5 hours until I interfered
Yesterday we did experience an incident, which is why your replica went down. It looks like it was rescheduled a few times but kept hanging, it would have eventually come back up. This is an issue with Nomad’s behavior, where it isn’t really able to distinguish between scheduling failing due to issue on the infrastructure and scheduling failing due to the app/vm itself being broken.
I’m sorry you were affected by the incident yesterday, rest assured that your instance would have been restored once the incident was cleared, in time.