Recurring error since latest service interruption incident

Hello

There has been a service incident lately (past 2-3 days), which has been resolved. However, my Rails app repeatedly gets this ActiveRecord::ConnectionNotEstablished in at_exit error.

According to HoneyBadger, there have been 10,242 occurrences in total. Here is the description:

ActiveRecord::ConnectionNotEstablished: connection to server at "MAC Address", port 5432 failed: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

What’s wrong? And how can I fix it?

Thank you!

Hi… It looks like several people’s Postgres servers may have been thrown for a loop recently, although it’s not clear which platform incidents caused which exact app-level problems, :adhesive_bandage:…

What does fly m list -a db-app-name show at the moment?

Also, is Rails attempting its connection via Flycast?

This is what I get:

3 machines have been retrieved from app my-app-db.
View them in the UI here (​https://fly.io/apps/my-app-db/machines/)

my-app-db
ID            	NAME             	STATE  	CHECKS	REGION	ROLE   	IMAGE                             	IP ADDRESS                     	VOLUME              	CREATED             	LAST UPDATED        	PROCESS GROUP	SIZE                 
99999999999999	empty-grass-9999 	started	3/3   	sea   	replica	flyio/postgres-flex:16.4 (v0.0.62)	fdaa:9:5a3f:a7b:1a9:cdfc:19f3:2	vol_vppn6z3kpp9n7x2v	2024-09-30T16:48:03Z	2024-09-30T19:52:32Z	             	shared-cpu-2x:2048MB	
99999999999999	young-meadow-9999	started	3/3   	sea   	primary	flyio/postgres-flex:16.4 (v0.0.62)	fdaa:9:5a3f:a7b:1af:4acd:e052:2	vol_vg70p3ewz3gm180v	2024-09-30T16:46:40Z	2024-09-30T19:52:21Z	             	shared-cpu-2x:2048MB	
99999999999999	lively-rain-9999 	started	2/3   	sea   	zombie 	flyio/postgres-flex:16.4 (v0.0.62)	fdaa:9:5a3f:a7b:a5:4a13:41e2:2 	vol_4505n3573g950zxr	2024-09-30T16:49:24Z	2024-09-30T19:53:18Z	             	shared-cpu-2x:2048MB	

Although, I honestly don’t understand most of your answer. Sorry.

Is there a way I can fix this?

I attempted to restart the “zombie” machine and this is now what I see:

ID            	NAME             	STATE  	CHECKS	REGION	ROLE                                                                                                                                                                                              	IMAGE                             	IP ADDRESS                     	VOLUME              	CREATED             	LAST UPDATED        	PROCESS GROUP	SIZE                 
99999999999999	empty-grass-9999 	started	3/3   	sea   	replica                                                                                                                                                                                           	flyio/postgres-flex:16.4 (v0.0.62)	fdaa:9:5a3f:a7b:1a9:cdfc:19f3:2	vol_vppn6z3kpp9n7x2v	2024-09-30T16:48:03Z	2024-09-30T19:52:32Z	             	shared-cpu-2x:2048MB	
99999999999999	young-meadow-9999	started	3/3   	sea   	primary                                                                                                                                                                                           	flyio/postgres-flex:16.4 (v0.0.62)	fdaa:9:5a3f:a7b:1af:4acd:e052:2	vol_vg70p3ewz3gm180v	2024-09-30T16:46:40Z	2024-09-30T19:52:21Z	             	shared-cpu-2x:2048MB	
99999999999999	lively-rain-9999 	started	1/3   	sea   	500 Internal Server Error                                                                                                                                                                         	flyio/postgres-flex:16.4 (v0.0.62)	fdaa:9:5a3f:a7b:a5:4a13:41e2:2 	vol_4505n3573g950zxr	2024-09-30T16:49:24Z	2024-11-28T02:26:34Z	             	shared-cpu-2x:2048MB	
              	                 	       	      	      	failed to connect to local node: failed to connect to `host=fdaa:9:5a3f:a7b:a5:4a13:41e2:2 user=repmgr database=repmgr`: server error (FATAL: the database system is starting up (SQLSTATE 57P03))	

Regards

Overall, the HA clusters require a lot of manual intervention and gritty mechanics knowledge sometimes…

(I think you’ve already seen @uncvrd’s classic post, for example!)

Basically, you can either try to do steps along those lines to fix things from the inside, or take the simpler but less elegant forking approach:

https://community.fly.io/t/urgency-problems-with-postgres-the-database-is-not-responding/19926/2

(Ideally, there would be a fully managed alternative—which Fly apparently is still working on. That may end up costing ~$80/month, though. It’s rather unclear at this point…)

2 Likes

Thank you for your answer. I suspected I needed to use @uncvrd classic post as I’d used it before. I was hoping there would be a more “automated” solution.

Thank you nonetheless.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.