Can't reach database server in FRA

Hi,

I have a user that reported my website was down. When I checked the main page, I had indeed the following message:

2022-07-21T09:07:01Z app[4bebc390] fra [info]Error: Can't reach database server at `fra.top2.nearest.of.h***********-prod-db.internal`:`5432`
2022-07-21T09:07:01Z app[4bebc390] fra [info]Please make sure your database server is running at `fra.top2.nearest.of.h***********-prod-db.internal`:`5432`.
2022-07-21T09:07:01Z app[4bebc390] fra [info]    at RequestHandler.handleRequestError (/myapp/node_modules/@prisma/client/runtime/index.js:49670:13)
2022-07-21T09:07:01Z app[4bebc390] fra [info]    at RequestHandler.request (/myapp/node_modules/@prisma/client/runtime/index.js:49652:12)
2022-07-21T09:07:01Z app[4bebc390] fra [info]    at PrismaClient._request (/myapp/node_modules/@prisma/client/runtime/index.js:50572:18)
2022-07-21T09:07:01Z app[4bebc390] fra [info]    at loader95 (/myapp/build/index.js:18771:18)
2022-07-21T09:07:01Z app[4bebc390] fra [info]    at Object.callRouteLoader (/myapp/node_modules/@remix-run/server-runtime/data.js:77:14)
2022-07-21T09:07:01Z app[4bebc390] fra [info]HEAD / 500 - - 19.916 ms
2022-07-21T09:07:01Z app[4bebc390] fra [info]healthcheck ❌ {
2022-07-21T09:07:01Z app[4bebc390] fra [info]  error: Response {
2022-07-21T09:07:01Z app[4bebc390] fra [info]    size: 0,
2022-07-21T09:07:01Z app[4bebc390] fra [info]    [Symbol(Body internals)]: {
2022-07-21T09:07:01Z app[4bebc390] fra [info]      body: [ReadableStream],
2022-07-21T09:07:01Z app[4bebc390] fra [info]      type: null,
2022-07-21T09:07:01Z app[4bebc390] fra [info]      size: null,
2022-07-21T09:07:01Z app[4bebc390] fra [info]      boundary: null,
2022-07-21T09:07:01Z app[4bebc390] fra [info]      disturbed: false,
2022-07-21T09:07:01Z app[4bebc390] fra [info]      error: null
2022-07-21T09:07:01Z app[4bebc390] fra [info]    },
2022-07-21T09:07:01Z app[4bebc390] fra [info]    [Symbol(Response internals)]: {
2022-07-21T09:07:01Z app[4bebc390] fra [info]      url: 'http://172.19.4.74:8080/',
2022-07-21T09:07:01Z app[4bebc390] fra [info]      status: 500,
2022-07-21T09:07:01Z app[4bebc390] fra [info]      statusText: 'Internal Server Error',
2022-07-21T09:07:01Z app[4bebc390] fra [info]      headers: [Object],
2022-07-21T09:07:01Z app[4bebc390] fra [info]      counter: 0,
2022-07-21T09:07:01Z app[4bebc390] fra [info]      highWaterMark: 16384
2022-07-21T09:07:01Z app[4bebc390] fra [info]    }
2022-07-21T09:07:01Z app[4bebc390] fra [info]  }
2022-07-21T09:07:01Z app[4bebc390] fra [info]}

Strange things is that in parallel to that error message, everything was green in the dashboard, and I was able to query the database using fly proxy.

I had to push a dummy change to redeploy and now it is working. Do you have more information so I understand what was happening? I can’t seem to find any logs that help me understand what happened :confused:

It looks like your app’s node is having trouble connecting to the postgres app. If you were able to query it via fly proxy, I wonder if these HTTP 500 errors were being sent from the replica.

If you run into this again, you might want to start with fly checks list -a app name to make sure that all of the health checks are passing for all members in the cluster. For a failed app instance, you can isolate its logs with fly logs -i instance-id -a app-name

Hi @eli, thank you for the feedback.

At the moment, all checks are passing except this one:

pg   | critical | eb0f85f4   | fra    | HTTP | 1h52m ago    | HTTP GET http://172.19.3.146:5500/flycheck/pg: 500 Internal Server Error Output: "failed to connect to proxy: context deadline exceeded"[✓]

But it doesn’t seem to affect the web app in production.

“I wonder if these HTTP 500 errors were being sent from the replica”

Why would this happen? How can I check and control that?

Additionnally, when the problem occurred, using https://<app name>.fly.dev was working fine :thinking: Like if it was domain related (but it does not make sense…)

Is there a way to capture these events so I can act on them quickly?

FYI eb0f85f4 is the replica allocation, the failed check would make sense in that case I suppose.

Dear fly.io Team,

First of all: Thanks a lot for all the services you provide here. For the last few months, everything was running smoothely. However today, I wanted to run a new deployment of my app and in the migration step of the app, I got the same problem as described above.
1: My running application is not affected:
2. I cannot reach the Database from the migration step:

 2022-07-21T12:20:29.423301Z TRACE mio::poll: deregistering event source from poller
         Error: `could not connect to database`, `error connecting to server: failed to lookup address information: Temporary failure in name resolution`
         Caused by:
             0: error connecting to server: failed to lookup address information: Temporary failure in name resolution
             1: failed to lookup address information: Temporary failure in name resolution
         Main child exited normally with code: 1
         Starting clean up.
Error release command failed, deployment aborted

I retried the deployment multiple times, using two different db FQDNs, but both with the same errors:

  • top2.nearest.of.cptnhook-pg.internal
  • cptnhook-pg.internal (edited after @jerome found a typo. was cpthnhook-pg.internal before and indeed not used)

My “machines” also live in FRA.

Let me know what other information you could need.

Best regards,
Tim

PS: I did deplyments yesterday, which worked like a charm…

Happy to help! Since you mentioned you had no problem interacting with the database via fly proxy at the time, I just was making an informed guess.

It’s hard to say exactly what this 500 means based off that health check alone. Taking a look at the app logs, it looks like they’re reaching it over the .internal tld, so if the problem was with that hostname, then it’s possible that appname.fly.dev’s behavior is unrelated.

Your replica could be returning 500s for a few different reasons, which all seem roughly as likely at the moment. Just within the scope of the app, there could have been an issue specific to the failed instace, like an OOM, or the connection pool running low. Then there’s the postgres app itself, where HAproxy could be sending traffic to the wrong instance, and/or flapping.

To surface any health check failures moving forward, you can set up fly checks handlers to alert for you:

Hey @elbartus, you might have since seen this, but the behavior you’re describing could be driven by our partial API outage. You can subscribe to its status page here:

@eli thanks for your quick reply. According to your posted link to the status page, the API outage should be resolved, right? However I am still experiencing the same issue as described above. Is the outage really resolved or am I misinterpreting the status page?

Best regards,
Tim

That looks like a different issue, unrelated to the our API partial outage.

There’s a typo in the second hostname you provided, but I assume you’re using an environment variable from your release command.

I’ll try to find where your release command was run to figure out if this is an issue with a specific server.

1 Like

FYI: I just tested again right now and the deployment worked succesfully this time.

Hi @eli,

I’m starting to have deployment issue now :confused:

It is still related to the database:

2022-07-21T14:52:39Z runner[4c9298de] fra [info]Starting instance
2022-07-21T14:52:39Z runner[4c9298de] fra [info]Configuring virtual machine
2022-07-21T14:52:39Z runner[4c9298de] fra [info]Pulling container image
2022-07-21T14:52:39Z runner[4c9298de] fra [info]Unpacking image
2022-07-21T14:52:39Z runner[4c9298de] fra [info]Preparing kernel init
2022-07-21T14:52:39Z runner[4c9298de] fra [info]Configuring firecracker
2022-07-21T14:52:39Z runner[4c9298de] fra [info]Starting virtual machine
2022-07-21T14:52:40Z app[4c9298de] fra [info]Starting init (commit: c86b3dc)...
2022-07-21T14:52:40Z app[4c9298de] fra [info]Setting up swapspace version 1, size = 512 MiB (536866816 bytes)
2022-07-21T14:52:40Z app[4c9298de] fra [info]no label, UUID=6014f363-8400-**********
2022-07-21T14:52:40Z app[4c9298de] fra [info]Preparing to run: `docker-entrypoint.sh npx prisma migrate deploy` as root
2022-07-21T14:52:40Z app[4c9298de] fra [info]2022/07/21 14:52:40 listening on [fdaa:0:7502:a7b:a992:4c92:98de:2]:22 (DNS: [fdaa::3]:53)
2022-07-21T14:52:41Z app[4c9298de] fra [info]Prisma schema loaded from prisma/schema.prisma
2022-07-21T14:52:41Z app[4c9298de] fra [info]Datasource "db": PostgreSQL database "h**********_prod", schema "public" at "top2.nearest.of.h**********-prod-db.internal:5432"
2022-07-21T14:52:42Z app[4c9298de] fra [info]Error: P1001: Can't reach database server at `top2.nearest.of.h**********-prod-db.internal`:`5432`
2022-07-21T14:52:42Z app[4c9298de] fra [info]Please make sure your database server is running at `top2.nearest.of.h**********-prod-db.internal`:`5432`.
2022-07-21T14:52:43Z app[4c9298de] fra [info]Main child exited normally with code: 1
2022-07-21T14:52:43Z app[4c9298de] fra [info]Starting clean up.

The current deployed app is working properly which should prove the database is running.

Is there something wrong with my current database setup? Or is it related to the other issue mentioned by @elbartus?

hi again-- we’ve been digging into this on our end and I think we were able to get things back to normal with your pg app. We made sure that every instance there had a working .internal hostname. Taken together with the logs you’ve most recently posted (thank you!), we expect that a redeploy will succeed. Could you let us know how that goes? We’re always glad to help however we can.

:crossed_fingers: that this does the trick!

That did the trick :wink:

Is there something I could have done to avoid that? Always eager to learn and get better :wink:

Possibly, but I can’t imagine what it would be! This was definitely something broken on our end which affected a small proportion of fairly arbitrary .internal domains.

I’m glad to hear that things are running smoothly again— feel free to respond if you have any other questions questions we can help with !

1 Like

Hi @eli,

Just to let you know I faced the same error during a deploy:

	 Error: P1001: Can't reach database server at `top2.nearest.of.h*******-prod-db.internal`:`5432`
	 Please make sure your database server is running at `top2.nearest.of.h*******-prod-db.internal`:`5432`.

Thankfully, re-running the deploy seemed to work. Just wanted to highlight this temporary issue. Perhaps something should be investigated further?

1 Like

Hi, Is all of related issue has been fixed?

Yup, we’re pretty sure that the hostname failures we saw in FRA on Thursday have been taken care of. This general type of error can be host-specific, though. If you’re having issues right now, would you mind posting the error message you’re running into, alongside a short description of the problem with your db app? We’re always happy to take a look!

1 Like