App No Longer Communicating with Database after today's Outage

Hi,

I seem to now be able to run fly deploy again after the outage, but my app does NOT seem to be able to communicate with the database anymore. Any suggestions on troubleshooting? I can connect successfully to the database on my own.

This is the error I get from Phoenix when deploying:

         14:45:56.432 [error] Could not create schema migrations table. This error usually happens due to the following:
           * The database does not exist                   
           * The "schema_migrations" table, which Ecto uses for managing                                                                                       
             migrations, was defined by another library                      
           * There is a deadlock while migrating (such as using concurrent                                                                                     
             indexes with a migration_lock)                 
2 Likes

Also unable to connect to PG:

2022-10-28T14:49:47.708 app[ee979a89] ewr [info] Can’t reach database server at top2.nearest.of.lift-db.internal:5432

2022-10-28T14:49:47.708 app[ee979a89] ewr [info] Please make sure your database server is running at top2.nearest.of.lift-db.internal:5432.

How did y’all even get fly to redeploy your apps in EWR? Running fly restart leaves my app in a pending state for me

Yeah nevermind on the deployments working. My deployments are getting stuck at pending still. This is brutal, I have over 18 hours of downtime due to hosting issues this month.

1 Like

I’ve been running into a similar issue for a couple days now when running fly deploy with primary region in bos (edited for region typo, sorry):

Not sure if it has to do with the flycast network since the db is listed at <db-name>.flycast:5432.

[    0.152832] PCI: Fatal: No config space access function found
   INFO Starting init (commit: 15238e9)...
   INFO Preparing to run: `/app/bin/migrate` as nobody
   INFO [fly api proxy] listening at /.fly/api
  2023/10/23 17:02:51 listening on [fdaa:0:85f4:a7b:1ed:7be1:e30c:2]:22 (DNS: [fdaa::3]:53)
  17:02:53.323 [error] Postgrex.Protocol (#PID<0.167.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (<db-name>.flycast:5432): non-existing domain - :nxdomain

...

** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 2987ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:
    1. Ensuring your database is available and that you can connect to it
    2. Tracking down slow queries and making sure they are running fast enough
    3. Increasing the pool_size (although this increases resource consumption)
    4. Allowing requests to wait longer by increasing :queue_target and :queue_interval

Hi @mmark,

If you could double-check your primary region and it turns out to be bos, we do have a single host having DNS issues there. If a release_command ephemeral machine lands on that host it will have this kind of trouble. So - if the error you’re seeing is related to a release_command, you can try forcing the ephemeral machine to be created in another region (ewr is a good choice as it’s close to bos)

PRIMARY_REGION=ewr fly deploy

Let me know if that works - it should be a temporary workaround while we get that pesky bos host’s DNS working.

  • Daniel

Hi Daniel,

Thanks! I just realized I noted the wrong region, I am using bos as primary. I will try right now with your suggestion and update :slight_smile:

Update
fly deploy is working again now for the bos region, thanks!