[FAILURE] Postgres stopped working: failed to connect to proxy: context deadline exceeded

Connections to postgres cluster dropped.

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS           	HEALTH CHECKS                 	RESTARTS	CREATED
ecdf4896	app    	5      	ams   	run    	running (leader) 	3 total, 2 passing, 1 critical	0       	2022-05-21T09:44:26Z
92c064e6	app    	5      	ams   	run    	running (replica)	3 total, 2 passing, 1 critical	0       	2022-05-21T09:43:31Z
Instance
  ID            = ecdf4896
  Process       =
  Version       = 5
  Region        = ams
  Desired       = run
  Status        = running (leader)
  Health Checks = 3 total, 2 passing, 1 critical
  Restarts      = 0
  Created       = 2022-05-21T09:44:26Z

Recent Events
TIMESTAMP            TYPE       MESSAGE
2022-05-21T09:44:22Z Received   Task received by client
2022-05-21T09:44:39Z Task Setup Building Task Directory
2022-05-21T09:44:47Z Started    Task started by client

Checks
ID   SERVICE STATE    OUTPUT
vm   app     passing  HTTP GET http://172.19.4.106:5500/flycheck/vm: 200 OK Output: "[✓] checkDisk: 8.78 GB (59.7%) free space on /data/ (33.64µs)\n[✓] checkLoad: load averages: 0.00 0.02 0.01 (119.65µs)\n[✓] memory: system spent 0s of the last 60s waiting on memory (31.42µs)\n[✓] cpu: system spent 354ms of the last 60s waiting on cpu (15.95µs)\n[✓] io: system spent 0s of the last 60s waiting on io (15.72µs)"

role app     passing  leader
pg   app     critical HTTP GET http://172.19.4.106:5500/flycheck/pg: 500 Internal Server Error Output: "failed to connect to proxy: context deadline exceeded

After restart

Deployment Status
  ID          = 838345b7-ecca-73d9-e489-e7a0a4b65b8b
  Version     = v6
  Status      = failed
  Description = Failed due to unhealthy allocations
  Instances   = 2 desired, 1 placed, 0 healthy, 1 unhealthy

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS           	HEALTH CHECKS                 	RESTARTS	CREATED
428592c5	app    	6      	ams   	run    	running (replica)	3 total, 3 passing            	0       	1m37s ago
03c079da	app    	6      	ams   	run    	running (replica)	3 total, 2 passing, 1 critical	0       	10m49s ago

fly restart rebooted only the replica. I used fly vm restart to reboot the master.

Postgres started working again, though fly status shows Failed due to unhealthy allocations

Deployment Status
  ID          = 838345b7-ecca-73d9-e489-e7a0a4b65b8b
  Version     = v6
  Status      = failed
  Description = Failed due to unhealthy allocations
  Instances   = 2 desired, 1 placed, 0 healthy, 1 unhealthy

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS           	HEALTH CHECKS     	RESTARTS	CREATED
428592c5	app    	6      	ams   	run    	running (replica)	3 total, 3 passing	0       	9m27s ago
03c079da	app    	6      	ams   	run    	running (leader) 	3 total, 3 passing	0       	18m39s ago

What was it and how to ensure it doesn’t happen again? There was no alarm and no clear steps to resolve.

This is super concerning, considering it’s a production app i recently migrated from AWS

Hey there! I’m sorry you ran into this! That error pg app critical HTTP GET http://172.19.4.106:5500/flycheck/pg: 500 Internal Server Error Output: "failed to connect to proxy: context deadline exceeded means that the postgres wasn’t connecting to the proxy correctly, fortunately your restart seems to have resolved that, unfortunately we won’t be able to troubleshoot further why that happened since it was restarted.

You can set up alerts with fly checks handlers to alert you when you a health check fails in the future so that you can catch these things earlier and hopefully get them resolved faster. You can read more about that here!

Hey zee

Unfortunately I couldn’t keep it for a proper dissection, it is a production app, downtime was already about 30 minutes.

Both postgres instances simultaneously failed to connect to proxy.

Maybe the issue was outside these instances?
Why did fly restart only restarted the replica? Why the master hasn’t been restarted?
Why the new replica also had the same issue?
Why the automatic instance rebuild didn’t kick in after health checks failure?
Would it have helped to have a third replica in another region?

It good I was near the console when it happened. It makes me nervous what might happen next time. I know it’s not a managed db service, but it feels the problem was the the infrastructure, not the postgres itself.

I really like to know what can be done to avoid it in the future

I completely understand not being able to leave that app down, again, my apologies that you ran into this!

Unfortunately we don’t know exactly why your DB ran into this issue, but we are looking into it. A third replica probably wouldn’t have helped in this case, but you can configure your app to bypass our proxy and connect to port 5433. This will point to either a replica or the leader and therefore doesn’t guarantee writes will work, but can be a more reliable setup because it’s one less moving part.

I hope this helps you a bit!

Another issue with Postgers happened today. Am I lucky

The replica disappeared today. There were a few DBConnection.ConnectionError. Later I noticed that Metrics page showed that our db halved in size (I assume db size here is a sum of all instances), and Monitoring showed only one instance.

fly scale show returns count: 2
fly status lists one instance.

Why the replica didn’t get restored if it failed? It clearly shows that the platform knows that there should be two instances and only one is running. There are still two volumes. And perhaps a potential data loss waiting to happen if the remaining instance reattaches to the wrong volume

image

➜  fly scale show
VM Resources for ...
        VM Size: shared-cpu-1x
      VM Memory: 2 GB
          Count: 2
 Max Per Region: Not set

➜  fly status
App
  Name     = ...
  Owner    = ...
  Version  = 6
  Status   = running
  Hostname = ....fly.dev

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS          	HEALTH CHECKS     	RESTARTS	CREATED
03c079da	app    	6      	ams   	run    	running (leader)	3 total, 3 passing	0       	2022-06-01T12:59:43Z

I ran fly scale count 2 to force it to get back to 2 instances. The instance was down for 5 hours until I interfered

1 Like

Yesterday we did experience an incident, which is why your replica went down. It looks like it was rescheduled a few times but kept hanging, it would have eventually come back up. This is an issue with Nomad’s behavior, where it isn’t really able to distinguish between scheduling failing due to issue on the infrastructure and scheduling failing due to the app/vm itself being broken.

I’m sorry you were affected by the incident yesterday, rest assured that your instance would have been restored once the incident was cleared, in time.

1 Like

Thanks! Good to know