private network down.

hi there, we have 3 instances on fly. one of them is postgresql and starting from monday. we see that network become unreachable and throws below error


2022-06-02T16:28:24.647 app[f3cc9f5e] ams [info] {"level":50,"time":1654187304647,"pid":537,"hostname":"f3cc9f5e","err":{"type":"Error","message":"getaddrinfo ENOTFOUND *-pg-prod.internal","stack":"Error: getaddrinfo ENOTFOUND *-pg-prod.internal\n    at GetAddrInfoReqWrap.onlookup [as oncomplete] (node:dns:83:26)","errno":-3008,"code":"ENOTFOUND","syscall":"getaddrinfo","hostname":"*-pg-prod.internal"},"msg":"getaddrinfo ENOTFOUND *-pg-prod.internal"}

plus, our deployments starts to fail with

Error failed to fetch an image or build from source: error rendering push status stream: name unknown: app repository not found

does anyone get similar errors?

2022-06-02T16:10:56.238 app[32bb460f] ams [info] proxy | [NOTICE] 152/161056 (540) : haproxy version is 2.2.9-2+deb11u2

2022-06-02T16:10:56.238 app[32bb460f] ams [info] proxy | [NOTICE] 152/161056 (540) : path to executable is /usr/sbin/haproxy

2022-06-02T16:10:56.238 app[32bb460f] ams [info] proxy | [ALERT] 152/161056 (540) : Current worker #1 (563) exited with code 130 (Interrupt)

2022-06-02T16:10:56.238 app[32bb460f] ams [info] proxy | [WARNING] 152/161056 (540) : All workers exited. Exiting… (130)

2022-06-02T16:10:56.238 app[32bb460f] ams [info] keeper | 2022-06-02 16:10:56.237 UTC [591] LOG: received fast shutdown request

2022-06-02T16:10:56.238 app[32bb460f] ams [info] proxy | exit status 130

2022-06-02T16:10:56.239 app[32bb460f] ams [info] keeper | 2022-06-02 16:10:56.238 UTC [591] LOG: aborting any active transactions

2022-06-02T16:10:56.242 app[32bb460f] ams [info] keeper | 2022-06-02 16:10:56.241 UTC [591] LOG: background worker “logical replication launcher” (PID 599) exited with exit code 1

2022-06-02T16:10:56.243 app[32bb460f] ams [info] keeper | 2022-06-02T16:10:56.243Z INFO postgresql/postgresql.go:384 stopping database

2022-06-02T16:10:56.245 app[32bb460f] ams [info] keeper | waiting for server to shut down…2022-06-02 16:10:56.245 UTC [7947] FATAL: terminating connection due to administrator command

2022-06-02T16:10:56.247 app[32bb460f] ams [info] keeper | 2022-06-02 16:10:56.246 UTC [594] LOG: shutting down

2022-06-02T16:10:56.257 app[32bb460f] ams [info] keeper | 2022-06-02 16:10:56.256 UTC [591] LOG: database system is shut down

2022-06-02T16:10:56.344 app[32bb460f] ams [info] keeper | done

2022-06-02T16:10:56.344 app[32bb460f] ams [info] keeper | server stopped

2022-06-02T16:10:56.347 app[32bb460f] ams [info] keeper | Process exited 0

2022-06-02T16:10:57.232 app[32bb460f] ams [info] Main child exited normally with code: 0

I have the same issue

Getting a lot of networking errors now as well. Production down for us.

Thank you for letting us know! We’re looking into this on our end and will keep you updated

Cheers Eli, please update as soon as possible. If this is a big issue, we’ll need to point our production back to Google Cloud - ideally we don’t need to!

We’re getting hundreds-thousands of these, but the endpoint does also work, so seems frequent but not every request:

FetchError: request to https://stackable-backend-prod.fly.dev/api/stack failed, reason: getaddrinfo EAI_AGAIN stackable-backend-prod.fly.dev\n at ClientRequest. (/usr/src/app/node_modules/next/dist/compiled/node-fetch/index.js:1:64142)\n

Yes, absolutely! We just posted a status page update: https://status.flyio.net/ where you can follow our progress

It seems deployments via the command: ‘flyctl deploy’ also don’t work.

1 Like

I had one Postgres instance disappeared today from a two instance cluster. Also in the ams region. It didn’t get recovered automatically.

The June 3 incident report is so terse and opaque (featuring an incomplete sentence, even). The outage didn’t seem minor and has probably affected internal DNS / 6pn, and logging. At least one user here reported a lost db instance (though, it is uncertain if the incident played a part).


June 3, 2022

Logging infrastructure issues

Identified - The issue has been identified and a fix is being implemented.
Jun 2, 16:39 CDT

Investigating - We are investigating logging failures. These are preventing logs from reaching log shipper apps, as well as
Jun 2, 15:30 CDT

versus

Jun 2, 2022

Deployment issues

Update - We are continuing to monitor for any further issues.
Jun 2, 13:12 CDT

Monitoring - We restored the Consul cluster and are monitoring individual services. Most service is restored, but some lingering issues may persist.
Jun 2, 12:30 CDT

Identified - Our Consul servers are in out-of-memory loops. We are working to recover them. Consul issues can cause problems with internal DNS resolution, deploys, and API access. We are working to recover the servers.
Jun 2, 12:19 CDT

1 Like

I still cannot deploy for some reason. I keep getting:
Error failed to fetch an image or build from source: executing lifecycle: failed with status code: 51

I am using: flyctl deploy --remote-only and it was working a few days ago.

Can you show us a bit more logs?