Postgres dropping connection after 30s when using LISTEN/NOTIFY

I’m trying out fly.io with a Rails 7 app using Turbo with ActionCable and the good_job gem.

I’ve configured both ActionCable and good_job to use the postgres NOTIFY/LISTEN feature for asynchronous notifications. This worked fine running on Heroku, but switching to fly.io I’m seeing the db connect reset every 60 seconds or so.

This might be related to this topic, but the I’m seeing timeouts of 60 seconds vs 30 minutes.

For ActionCable errors I’m getting:

 `wait_for_notify': PQconsumeInput() server closed the connection unexpectedly (PG::ConnectionBad)

For my good_job process I get:

[GoodJob] Notifier unsubscribed with UNLISTEN
[GoodJob] Notifier errored: ActiveRecord::StatementInvalid: PG::ConnectionBad: PQsocket() can't get socket descriptor

I’ve tried setting the tcp_keepalives_idle value to 30s on the postgres instance, but did not have an effect.

Unless someone else has a fix, I think my only solution would be to spin up a redis instance and use that for ActionCable and disable enable_listen_notify for good_job.

Any advice is appreciated. Thanks.

2 Likes

Hi @jasonyork! Since it sounds like you created your Postgres DB recently, your app has probably been configured to connect to it through Flycast (docs + recent forum posts). That is, your connections to the DB get proxied through fly-proxy, which has a few benefits that are described in the first of the linked forum posts. However, the proxy currently has a 60-second timeout for idle TCP connections, so I think it’s pretty likely that that’s the cause. Unfortunately TCP keepalives also won’t help, since those happen at the kernel level and aren’t exposed to the proxy.

To verify whether you’re using Flycast, you can check the hostname you’re using to connect to the database (probably your DATABASE_URL secret). A .flycast domain name or an fdaa... IPv6 address matching one from fly ips list means it’s going through the proxy.

I know there’s been some work going to try to remove the 60 second timeout, while still protecting fly-proxy from resource exhaustion. For now, if this turns out to be the issue, you can try connecting to your database instances directly instead. You can use the domain name top2.nearest.of.<your database app's name>.internal, which will resolve to the two nearest instances of your database. (That’s what fly postgres attach would configure before we changed over to using Flycast by default.)

5 Likes

Thanks @MatthewIngwersen ! I can confirm that the host was a .flycast domain. I was able to make the change to my DATABASE_URL as you describe above and I no longer see the connection resets so that does seem to have resolved the problem. I appreciate the clarity and the quick response. Thanks!

@MatthewIngwersen - It looks like I still have a problem. With the change above, I now see the connection consistently reset every 30 minutes. Do you know what might be causing that and if there is a potential fix?

Hi Jason, sorry for the delay in getting back to you. I think the 30 minute timeout might be related to the topic you shared in your original post. Fly Postgres has a built-in HAProxy to forward connections to the primary instance, and it’s configured with a 30 minute timeout.

Connecting to port 5433 (as opposed to 5432) will go directly to the Postgres that you connect to. However, if you’re running multiple instances, it might be a read replica. (I’m not a PG expert, so I’m not sure if that’s acceptable for NOTIFY/LISTEN.)

Hi @MatthewIngwersen, thanks. I may try moving to port 5433 in the future. For now, I decided to add the basic Redis instance and use it for ActionCable and switch good_job to polling mode. It does seem more stable, and I no longer see any dropped Postgresql connections.

UnfortunateIy, I do now see periodic Redis connections dropping, causing the app to crash and restart. I haven’t seen a consistent pattern, but is in the order of a couple times a day. I can probably live with that for now since the app isn’t super critical at this point. Here’s what the crash looks like:

iad [info]#<Thread:0x00007f3251870750 /rails/vendor/bundle/ruby/3.1.0/gems/actioncable-7.0.4.3/lib/action_cable/subscription_adapter/redis.rb:150 run> terminated with exception (report_on_exception is true):
iad [info]/rails/vendor/bundle/ruby/3.1.0/gems/redis-4.8.1/lib/redis/client.rb:306:in `rescue in io': Connection lost (ECONNRESET) (Redis::ConnectionError)

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.