DB Connection issues

eliel · June 28, 2022, 6:52pm

Having issues with connecting to my database all of a sudden. Region in lax. Everything was fine then all of a sudden nothing can connect to it. I’ve had similar issues with my other apps in production. Do others experience this issue? Can we rely on fly for hosting a database? It seems like there are always connection issues that happen way too frequently for anything stable.

On the PG instance getting this:
Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[address]:5432/postgres?sslmode=disable): dial tcp [address]:5432: connect: connection refused source=“postgres_exporter.go:1658”

Mark · July 1, 2022, 12:24am

Hi @eliel!

Have you checked on your IPv6 settings for Ecto? They can be set in the Dockerfile using ENV.

eliel · July 1, 2022, 12:33am

@Mark I believe our project is setup for ipv6. We have been running this server for a few months now. It only happened all of a sudden after I made some UI changes to my phoenix app.

The postgres VM was alive, it just wouldn’t allow any connections. I have a postres client that I use to regularly check the database (Postico) and I couldn’t connect there either.

I had to scale down the postgres vm, update the postgres image and scale it back up. Very alarming, the error I was getting was this.

(postgresql://flypgadmin:PASSWORD_REMOVED@[address]:5432/postgres?sslmode=disable): dial tcp [address]:5432: connect: connection refused source=“postgres_exporter.go:1658”

This was captured from the logs in the Postgres VM btw

Mark · July 1, 2022, 5:02pm

Ah, thanks for the context. Yes, it’s likely not suddenly a config issue.

Mark · July 1, 2022, 5:46pm

@eliel I looked into it more. Sorry that I don’t have an answer, but just wanted to relay that it is being actively researched by the platform team.

eliel · October 9, 2022, 9:20pm

@Mark Any updates on this? We are getting this issue again. Our production app was down for about 5mins because of our Postgres app again.

Mark · October 10, 2022, 7:39pm

Hi @eliel,

Wanted to make sure you were aware of how to get the logs for your postgres DB app:

fly logs -a my-postgres-app-db

To find the name of the postgres app, you can use:

fly apps list

Mark · October 10, 2022, 7:54pm

You can also run this command for info:

fly checks list -a my-postgres-app-db

eliel · October 10, 2022, 8:10pm

@Mark This is the output, doesnt seem to show anything recent:

Health Checks for classly-prod-pg
  NAME | STATUS  | ALLOCATION | REGION | TYPE | LAST UPDATED         | OUTPUT                                                                   
-------*---------*------------*--------*------*----------------------*--------------------------------------------------------------------------
  pg   | passing | fae7f0d0   | lax    | HTTP | 57s ago              | HTTP GET http://172.19.1.90:5500/flycheck/pg: 200 OK Output: "[✓]        
       |         |            |        |      |                      |                                                                          
       |         |            |        |      |                      | transactions: read/write (246.62µs)\n[✓]                                 
       |         |            |        |      |                      |                                                                          
       |         |            |        |      |                      | connections: 72 used, 3 reserved, 300 max (5.75ms)"[✓]                   
       |         |            |        |      |                      |                                                                          
       |         |            |        |      |                      |                                                                          
  vm   | passing | fae7f0d0   | lax    | HTTP | 4m46s ago            | HTTP GET http://172.19.1.90:5500/flycheck/vm: 200 OK Output: "[✓]        
       |         |            |        |      |                      |                                                                          
       |         |            |        |      |                      | checkDisk: 676.66 MB (69.3%) free space on /data/ (67.96µs)\n[✓]         
       |         |            |        |      |                      |                                                                          
       |         |            |        |      |                      | checkLoad: load averages: 0.05 0.03 0.00 (129.29µs)\n[✓]                 
       |         |            |        |      |                      |                                                                          
       |         |            |        |      |                      | memory: system spent 0s of the last 60s waiting on memory (47.5µs)\n[✓]  
       |         |            |        |      |                      |                                                                          
       |         |            |        |      |                      | cpu: system spent 1.12s of the last 60s waiting on cpu (53.52µs)\n[✓]    
       |         |            |        |      |                      |                                                                          
       |         |            |        |      |                      | io: system spent 0s of the last 60s waiting on io (49.77µs)"[✓]          
       |         |            |        |      |                      |                                                                          
       |         |            |        |      |                      |                                                                          
  role | passing | fae7f0d0   | lax    | HTTP | 2022-10-09T20:08:48Z | leader[✓]                                                                
       |         |            |        |      |                      |                                                                          
       |         |            |        |      |                      |

Mark · October 10, 2022, 8:18pm

I checked with some people on the Platform team and Support.

I remember seeing issues with Elixir/Phoenix apps a while back. I’m not 100% this is the issue, but there were issues with connections not being closed properly during deploys. The end result was either OOM or running out of connections.

I’ve only seen it happen with Phoenix apps.

(OOM = Out Of Memory)

I don’t know if that’s related. But it makes me wonder if older versions of ecto or ecto_sql or postgrex might be a problem?

eliel · October 10, 2022, 8:20pm

@Mark Weirdly looking at the graphana dashboard, this is a gap in the logs for the pg app

Mark · October 10, 2022, 8:34pm

@eliel Weird. It blanks out then the CPU pegs, memory falls and network blips, then silence.

eliel · October 10, 2022, 8:54pm

Yeah, very weird. Our phoenix app was fine during that period. We also haven’t deployed in a few weeks so it shouldn’t be a deploy issue. Those blimps the graphana logs correspond to the times we noticed the app was down due to postgres connection issues. We also had very very low traffic so it should be memory / connections issue

kurt · October 10, 2022, 8:59pm

That looks like pg crashed from the graphs. Note how one color goes away and a new color of line replaces it.

fly status --all -a <pg-name> will let you know if one of the VMs failed. Then you can run fly vm status <id> -a <pg-name> to see why.

eliel · October 10, 2022, 9:04pm

@Mark @kurt This is what I get when running the status:

Instances
ID              PROCESS VERSION REGION  DESIRED STATUS                  HEALTH CHECKS           RESTARTS        CREATED              
fae7f0d0        app     5       lax     run     running (leader)        3 total, 3 passing      0               2022-10-09T20:06:17Z
e9ce1fb2        app     5       lax     stop    failed                                          0               2022-10-09T19:51:14Z
542badff        app     5       lax     stop    failed                                          0               2022-10-09T19:49:16Z

Then running vm status on instance e9ce1fb2:

Events
TIMESTAMP               TYPE            MESSAGE                                                                                                       
2022-10-09T19:51:32Z    Received        Task received by client                                                                                      
2022-10-09T19:51:32Z    Task Setup      Building Task Directory                                                                                      
2022-10-09T20:05:33Z    Driver Failure  rpc error: code = Unknown desc = unable to create microvm: could not find device for volume with name pg_data
2022-10-09T20:05:33Z    Not Restarting  Error was unrecoverable                                                                                      
2022-10-09T20:05:35Z    Killing         Sent interrupt. Waiting 5m0s before force killing                                                            

Checks
ID      SERVICE STATE   OUTPUT 

Recent Logs

Then on instance 542badff:

Events
TIMESTAMP               TYPE            MESSAGE                                                                                                                                                    
2022-10-09T19:48:56Z    Received        Task received by client                                                                                                                                   
2022-10-09T19:48:56Z    Task Setup      Building Task Directory                                                                                                                                   
2022-10-09T19:50:25Z    Driver Failure  rpc error: code = Unknown desc = unable to make tap and generate ip addresses: kill zombies: found colliding a88263-2f9a9d3d: interface is up, cannot reap
2022-10-09T19:50:25Z    Not Restarting  Error was unrecoverable                                                                                                                                   
2022-10-09T19:50:53Z    Killing         Sent interrupt. Waiting 5m0s before force killing                                                                                                         
2022-10-09T19:50:53Z    Killing         Sent interrupt. Waiting 5m0s before force killing                                                                                                         
2022-10-09T20:05:09Z    Received        Task received by client                                                                                                                                   
2022-10-09T20:07:21Z    Killing         Sent interrupt. Waiting 5m0s before force killing                                                                                                         

Checks
ID      SERVICE STATE   OUTPUT 

Recent Logs

kurt · October 10, 2022, 9:05pm

That looks like it went into a crash loop and then took a bit for us to recover. Those driver failures are us cleaning up the env after a previous crash.

kurt · October 10, 2022, 9:07pm

If there are older failed vms, just check status on each of them. You’ll eventually find one with the original error.

Also, scaling up to two nodes will help your app handle this state. These crashes are almost always out-of-memory errors, so adding more memory is a good bet.

eliel · October 10, 2022, 9:15pm

Memory usage seemed fine? Using ~330 / 512mb. Very little to no traffic. When I listed the status only 2 old instances came up. Is there a way to narrow this down? Seems a little random, although this has happened in the past (a month or two ago)

kurt · October 10, 2022, 9:36pm

Ok I found the original alloc. It exited with no information, then took a couple of minutes to come back.

I think your best bet is to add a second node and see if it happens again. Two nodes will give you HA, so if there are issues with one your app won’t have connectivity problems.