private network down.

erdo · June 2, 2022, 4:33pm

hi there, we have 3 instances on fly. one of them is postgresql and starting from monday. we see that network become unreachable and throws below error


2022-06-02T16:28:24.647 app[f3cc9f5e] ams [info] {"level":50,"time":1654187304647,"pid":537,"hostname":"f3cc9f5e","err":{"type":"Error","message":"getaddrinfo ENOTFOUND *-pg-prod.internal","stack":"Error: getaddrinfo ENOTFOUND *-pg-prod.internal\n    at GetAddrInfoReqWrap.onlookup [as oncomplete] (node:dns:83:26)","errno":-3008,"code":"ENOTFOUND","syscall":"getaddrinfo","hostname":"*-pg-prod.internal"},"msg":"getaddrinfo ENOTFOUND *-pg-prod.internal"}

plus, our deployments starts to fail with

Error failed to fetch an image or build from source: error rendering push status stream: name unknown: app repository not found

does anyone get similar errors?

erdo · June 2, 2022, 4:39pm

2022-06-02T16:10:56.238 app[32bb460f] ams [info] proxy | [NOTICE] 152/161056 (540) : haproxy version is 2.2.9-2+deb11u2

2022-06-02T16:10:56.238 app[32bb460f] ams [info] proxy | [NOTICE] 152/161056 (540) : path to executable is /usr/sbin/haproxy

2022-06-02T16:10:56.238 app[32bb460f] ams [info] proxy | [ALERT] 152/161056 (540) : Current worker #1 (563) exited with code 130 (Interrupt)

2022-06-02T16:10:56.238 app[32bb460f] ams [info] proxy | [WARNING] 152/161056 (540) : All workers exited. Exiting… (130)

2022-06-02T16:10:56.238 app[32bb460f] ams [info] keeper | 2022-06-02 16:10:56.237 UTC [591] LOG: received fast shutdown request

2022-06-02T16:10:56.238 app[32bb460f] ams [info] proxy | exit status 130

2022-06-02T16:10:56.239 app[32bb460f] ams [info] keeper | 2022-06-02 16:10:56.238 UTC [591] LOG: aborting any active transactions

2022-06-02T16:10:56.242 app[32bb460f] ams [info] keeper | 2022-06-02 16:10:56.241 UTC [591] LOG: background worker “logical replication launcher” (PID 599) exited with exit code 1

2022-06-02T16:10:56.243 app[32bb460f] ams [info] keeper | 2022-06-02T16:10:56.243Z INFO postgresql/postgresql.go:384 stopping database

2022-06-02T16:10:56.245 app[32bb460f] ams [info] keeper | waiting for server to shut down…2022-06-02 16:10:56.245 UTC [7947] FATAL: terminating connection due to administrator command

2022-06-02T16:10:56.247 app[32bb460f] ams [info] keeper | 2022-06-02 16:10:56.246 UTC [594] LOG: shutting down

2022-06-02T16:10:56.257 app[32bb460f] ams [info] keeper | 2022-06-02 16:10:56.256 UTC [591] LOG: database system is shut down

2022-06-02T16:10:56.344 app[32bb460f] ams [info] keeper | done

2022-06-02T16:10:56.344 app[32bb460f] ams [info] keeper | server stopped

2022-06-02T16:10:56.347 app[32bb460f] ams [info] keeper | Process exited 0

2022-06-02T16:10:57.232 app[32bb460f] ams [info] Main child exited normally with code: 0

rculver · June 2, 2022, 5:02pm

I have the same issue

bkspace · June 2, 2022, 5:02pm

Getting a lot of networking errors now as well. Production down for us.

eli · June 2, 2022, 5:06pm

Thank you for letting us know! We’re looking into this on our end and will keep you updated

bkspace · June 2, 2022, 5:08pm

Cheers Eli, please update as soon as possible. If this is a big issue, we’ll need to point our production back to Google Cloud - ideally we don’t need to!

We’re getting hundreds-thousands of these, but the endpoint does also work, so seems frequent but not every request:

FetchError: request to https://stackable-backend-prod.fly.dev/api/stack failed, reason: getaddrinfo EAI_AGAIN stackable-backend-prod.fly.dev\n at ClientRequest. (/usr/src/app/node_modules/next/dist/compiled/node-fetch/index.js:1:64142)\n

eli · June 2, 2022, 5:10pm

Yes, absolutely! We just posted a status page update: https://status.flyio.net/ where you can follow our progress

artyroip · June 2, 2022, 5:13pm

It seems deployments via the command: ‘flyctl deploy’ also don’t work.

Elder · June 2, 2022, 10:06pm

I had one Postgres instance disappeared today from a two instance cluster. Also in the ams region. It didn’t get recovered automatically.

ignoramous · June 4, 2022, 6:19am

The June 3 incident report is so terse and opaque (featuring an incomplete sentence, even). The outage didn’t seem minor and has probably affected internal DNS / 6pn, and logging. At least one user here reported a lost db instance (though, it is uncertain if the incident played a part).

June 3, 2022

Logging infrastructure issues

Identified - The issue has been identified and a fix is being implemented.
Jun 2, 16:39 CDT

Investigating - We are investigating logging failures. These are preventing logs from reaching log shipper apps, as well as
Jun 2, 15:30 CDT

versus

Jun 2, 2022

Deployment issues

Update - We are continuing to monitor for any further issues.
Jun 2, 13:12 CDT

Monitoring - We restored the Consul cluster and are monitoring individual services. Most service is restored, but some lingering issues may persist.
Jun 2, 12:30 CDT

Identified - Our Consul servers are in out-of-memory loops. We are working to recover them. Consul issues can cause problems with internal DNS resolution, deploys, and API access. We are working to recover the servers.
Jun 2, 12:19 CDT

michaelfrieze · June 4, 2022, 11:35am

I still cannot deploy for some reason. I keep getting:
Error failed to fetch an image or build from source: executing lifecycle: failed with status code: 51

I am using: flyctl deploy --remote-only and it was working a few days ago.

jerome · June 4, 2022, 4:55pm

Can you show us a bit more logs?

Topic		Replies	Views
Errno: -3007 ENOTFOUND Build debugging postgres	2	1126	February 1, 2023
Postgres App not working anymore postgres	9	452	May 2, 2024
Unable to restart Fly Postgres cluster	0	294	November 1, 2022
[FAILURE] Postgres stopped working: failed to connect to proxy: context deadline exceeded Questions / Help postgres	7	1082	June 3, 2022
Getting "network unreachable" when connecting to postgres from external Questions / Help postgres	3	767	June 16, 2023

private network down.

Related topics