Having issues with connecting to my database all of a sudden. Region in lax. Everything was fine then all of a sudden nothing can connect to it. I’ve had similar issues with my other apps in production. Do others experience this issue? Can we rely on fly for hosting a database? It seems like there are always connection issues that happen way too frequently for anything stable.
On the PG instance getting this:
Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[address]:5432/postgres?sslmode=disable): dial tcp [address]:5432: connect: connection refused source=“postgres_exporter.go:1658”
@Mark I believe our project is setup for ipv6. We have been running this server for a few months now. It only happened all of a sudden after I made some UI changes to my phoenix app.
The postgres VM was alive, it just wouldn’t allow any connections. I have a postres client that I use to regularly check the database (Postico) and I couldn’t connect there either.
I had to scale down the postgres vm, update the postgres image and scale it back up. Very alarming, the error I was getting was this.
I checked with some people on the Platform team and Support.
I remember seeing issues with Elixir/Phoenix apps a while back. I’m not 100% this is the issue, but there were issues with connections not being closed properly during deploys. The end result was either OOM or running out of connections.
I’ve only seen it happen with Phoenix apps.
(OOM = Out Of Memory)
I don’t know if that’s related. But it makes me wonder if older versions of ecto or ecto_sql or postgrex might be a problem?
Yeah, very weird. Our phoenix app was fine during that period. We also haven’t deployed in a few weeks so it shouldn’t be a deploy issue. Those blimps the graphana logs correspond to the times we noticed the app was down due to postgres connection issues. We also had very very low traffic so it should be memory / connections issue
@Mark@kurt This is what I get when running the status:
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
fae7f0d0 app 5 lax run running (leader) 3 total, 3 passing 0 2022-10-09T20:06:17Z
e9ce1fb2 app 5 lax stop failed 0 2022-10-09T19:51:14Z
542badff app 5 lax stop failed 0 2022-10-09T19:49:16Z
Then running vm status on instance e9ce1fb2:
Events
TIMESTAMP TYPE MESSAGE
2022-10-09T19:51:32Z Received Task received by client
2022-10-09T19:51:32Z Task Setup Building Task Directory
2022-10-09T20:05:33Z Driver Failure rpc error: code = Unknown desc = unable to create microvm: could not find device for volume with name pg_data
2022-10-09T20:05:33Z Not Restarting Error was unrecoverable
2022-10-09T20:05:35Z Killing Sent interrupt. Waiting 5m0s before force killing
Checks
ID SERVICE STATE OUTPUT
Recent Logs
Then on instance 542badff:
Events
TIMESTAMP TYPE MESSAGE
2022-10-09T19:48:56Z Received Task received by client
2022-10-09T19:48:56Z Task Setup Building Task Directory
2022-10-09T19:50:25Z Driver Failure rpc error: code = Unknown desc = unable to make tap and generate ip addresses: kill zombies: found colliding a88263-2f9a9d3d: interface is up, cannot reap
2022-10-09T19:50:25Z Not Restarting Error was unrecoverable
2022-10-09T19:50:53Z Killing Sent interrupt. Waiting 5m0s before force killing
2022-10-09T19:50:53Z Killing Sent interrupt. Waiting 5m0s before force killing
2022-10-09T20:05:09Z Received Task received by client
2022-10-09T20:07:21Z Killing Sent interrupt. Waiting 5m0s before force killing
Checks
ID SERVICE STATE OUTPUT
Recent Logs
That looks like it went into a crash loop and then took a bit for us to recover. Those driver failures are us cleaning up the env after a previous crash.
If there are older failed vms, just check status on each of them. You’ll eventually find one with the original error.
Also, scaling up to two nodes will help your app handle this state. These crashes are almost always out-of-memory errors, so adding more memory is a good bet.
Memory usage seemed fine? Using ~330 / 512mb. Very little to no traffic. When I listed the status only 2 old instances came up. Is there a way to narrow this down? Seems a little random, although this has happened in the past (a month or two ago)
Ok I found the original alloc. It exited with no information, then took a couple of minutes to come back.
I think your best bet is to add a second node and see if it happens again. Two nodes will give you HA, so if there are issues with one your app won’t have connectivity problems.