We’re bringing PG clusters back up now. It’s going to take a 30+ minutes for all of them to settle (assuming the Consul issue we think we fixed is actually fixed).
So coming back to the original issue. I still have connection issues from time to time. Happened several times today. As soon as it’s happening again I’ll check logs and let you know about details.
@mathiasn will you run fly status -a <db-name>
and see if it’s been restarting? If you have random connection errors, it might mean that postgres is dying + restarting. This could happen because of OOMs, or potentially because the VM is failing health checks.
You can also run fly status -a <db-name> --all
to see if there are any failed Postgres VMs.
Yeah, I checked that but that wasn’t the problem. I see that in my logs:
2021-10-04T16:28:32.607314094Z app[c3991336] fra [info] keeper | 2021-10-04T16:28:32.607Z INFO cmd/keeper.go:1703 postgres hba entries not changed
2021-10-04T16:28:34.372375382Z app[c3991336] fra [info] sentinel | 2021-10-04T16:28:34.371Z INFO cmd/sentinel.go:94 sentinel leadership lost
2021-10-04T16:28:34.613629365Z app[c3991336] fra [info] sentinel | 2021-10-04T16:28:34.611Z ERROR cmd/sentinel.go:102 election loop error {"error": "Unexpected response code: 500 (rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection)"}
2021-10-04T16:28:44.614824417Z app[c3991336] fra [info] sentinel | 2021-10-04T16:28:44.613Z INFO cmd/sentinel.go:82 Trying to acquire sentinels leadership
2021-10-04T16:28:45.248003267Z app[c3991336] fra [info] sentinel | 2021-10-04T16:28:45.247Z ERROR cmd/sentinel.go:1843 error retrieving cluster data {"error": "Unexpected response code: 500"}
2021-10-04T16:28:45.287357763Z app[c3991336] fra [info] keeper | 2021-10-04T16:28:45.283Z ERROR cmd/keeper.go:1042 error retrieving cluster data {"error": "Unexpected response code: 500"}
2021-10-04T16:28:45.414813784Z app[c3991336] fra [info] proxy | 2021-10-04T16:28:45.414Z INFO cmd/proxy.go:346 check function error {"error": "cannot get cluster data: Unexpected response code: 500"}
2021-10-04T16:28:45.629651205Z app[c3991336] fra [info] proxy | 2021-10-04T16:28:45.629Z INFO cmd/proxy.go:304 check timeout timer fired
2021-10-04T16:28:45.631443594Z app[c3991336] fra [info] proxy | 2021-10-04T16:28:45.629Z INFO cmd/proxy.go:158 Stopping listening
2021-10-04T16:28:45.638706631Z app[c3991336] fra [info] keeper | 2021-10-04 16:28:45.637 UTC [15334] LOG: unexpected EOF on client connection with an open transaction
2021-10-04T16:28:50.602536023Z app[c3991336] fra [info] keeper | 2021-10-04T16:28:50.602Z INFO cmd/keeper.go:1505 our db requested role is master
2021-10-04T16:28:50.603381998Z app[c3991336] fra [info] keeper | 2021-10-04T16:28:50.603Z INFO cmd/keeper.go:1543 already master
2021-10-04T16:28:50.622795592Z app[c3991336] fra [info] keeper | 2021-10-04T16:28:50.622Z INFO cmd/keeper.go:1676 postgres parameters not changed
2021-10-04T16:28:50.623014301Z app[c3991336] fra [info] keeper | 2021-10-04T16:28:50.622Z INFO cmd/keeper.go:1703 postgres hba entries not changed
2021-10-04T16:28:50.877236098Z app[c3991336] fra [info] proxy | 2021-10-04T16:28:50.876Z INFO cmd/proxy.go:124 Starting proxying
2021-10-04T16:28:50.879662303Z app[c3991336] fra [info] proxy | 2021-10-04T16:28:50.879Z INFO cmd/proxy.go:268 master address {"address": "[fdaa:0:2d8e:a7b:67:0:2457:2]:5433"}
2021-10-04T16:28:51.274763461Z app[c3991336] fra [info] proxy | 2021-10-04T16:28:51.274Z INFO cmd/proxy.go:286 proxying to master address {"address": "[fdaa:0:2d8e:a7b:67:0:2457:2]:5433"}
2021-10-04T16:28:51.910751156Z app[c3991336] fra [info] sentinel | 2021-10-04T16:28:51.910Z INFO cmd/sentinel.go:89 sentinel leadership acquired
2021-10-04T16:28:52.615550171Z app[c3991336] fra [info] keeper | 2021-10-04 16:28:52.614 UTC [20923] LOG: PID 15338 in cancel request did not match any process
2021-10-04T16:28:56.398172558Z app[c3991336] fra [info] proxy | 2021-10-04T16:28:56.397Z INFO cmd/proxy.go:268 master address {"address": "[fdaa:0:2d8e:a7b:67:0:2457:2]:5433"}
2021-10-04T16:28:56.728869510Z app[c3991336] fra [info] proxy | 2021-10-04T16:28:56.728Z INFO cmd/proxy.go:286 proxying to master address {"address": "[fdaa:0:2d8e:a7b:67:0:2457:2]:5433"}
2021-10-04T16:29:00.955099000Z app[c3991336] fra [info] keeper | 2021-10-04T16:29:00.954Z INFO cmd/keeper.go:1505 our db requested role is master
2021-10-04T16:29:00.955828456Z app[c3991336] fra [info] keeper | 2021-10-04T16:29:00.955Z INFO cmd/keeper.go:1543 already master
2021-10-04T16:29:00.972456457Z app[c3991336] fra [info] keeper | 2021-10-04T16:29:00.972Z INFO cmd/keeper.go:1703 postgres hba entries not changed
2021-10-04T16:29:01.886192529Z app[c3991336] fra [info] proxy | 2021-10-04T16:29:01.885Z INFO cmd/proxy.go:268 master address {"address": "[fdaa:0:2d8e:a7b:67:0:2457:2]:5433"}
2021-10-04T16:29:02.009029488Z app[c3991336] fra [info] sentinel | 2021-10-04T16:29:02.008Z INFO cmd/sentinel.go:82 Trying to acquire sentinels leadership
2021-10-04T16:29:02.013667831Z app[c3991336] fra [info] sentinel | github.com/superfly/leadership.(*Candidate).initLock(0xc00013f420, 0x0, 0x0, 0x0, 0x0)
2021-10-04T16:29:02.013688101Z app[c3991336] fra [info] sentinel | /go/pkg/mod/github.com/superfly/leadership@v0.2.0/candidate.go:120 +0xc5
2021-10-04T16:29:02.013695461Z app[c3991336] fra [info] sentinel | created by github.com/superfly/leadership.(*Candidate).RunForElection
2021-10-04T16:29:02.019037149Z app[c3991336] fra [info] sentinel | exit status 2
2021-10-04T16:29:02.019051229Z app[c3991336] fra [info] supervisor stopping
2021-10-04T16:29:02.032714308Z app[c3991336] fra [info] keeper | 2021-10-04 16:29:02.031 UTC [20933] FATAL: terminating connection due to administrator command
2021-10-04T16:29:02.035267053Z app[c3991336] fra [info] keeper | 2021-10-04 16:29:02.034 UTC [20927] FATAL: terminating connection due to administrator command
2021-10-04T16:29:02.036928923Z app[c3991336] fra [info] keeper | 2021-10-04 16:29:02.036 UTC [20925] FATAL: terminating connection due to administrator command
2021-10-04T16:29:02.048222135Z app[c3991336] fra [info] keeper | waiting for server to shut down....2021-10-04 16:29:02.048 UTC [601] LOG: database system is shut down
2021-10-04T16:29:02.149057075Z app[c3991336] fra [info] keeper | server stopped
2021-10-04T16:29:02.158022202Z app[c3991336] fra [info] keeper | signal: interrupt
2021-10-04T16:29:02.570294779Z app[c3991336] fra [info] Main child exited normally with code: 0
2021-10-04T16:29:02.571891309Z app[c3991336] fra [info] Starting clean up.
2021-10-04T16:29:02.587783295Z app[c3991336] fra [info] Umounting /dev/vdc from /data
UPDATE: The postgres is not coming up again :-/
2021-10-04T16:30:25.422012886Z runner[c3991336] fra [info] Starting instance
2021-10-04T16:30:25.461124774Z runner[c3991336] fra [info] Configuring virtual machine
2021-10-04T16:30:25.464100656Z runner[c3991336] fra [info] Pulling container image
2021-10-04T16:30:26.205737502Z runner[c3991336] fra [info] Unpacking image
2021-10-04T16:30:26.216660758Z runner[c3991336] fra [info] Preparing kernel init
2021-10-04T16:30:26.363695085Z runner[c3991336] fra [info] Setting up volume 'pg_data'
2021-10-04T16:30:26.577683944Z runner[c3991336] fra [info] Configuring firecracker
2021-10-04T16:30:56.139020719Z runner[c3991336] fra [info] Starting instance
2021-10-04T16:35:01.266231914Z runner[cd1f75b0] fra [info] Starting instance
The log stops here basically.
Update: 2021-10-04 18:44 CEST: It’s back
Having the same issues. The postgres instance becomes unavailable from time to time causing the app to die. I set up an external uptime monitor a while back and this is how the results look:
Here’s the result of that command:
Instances
ID TASK VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
7ecbb18c app 7 ⇡ sin run running (leader) 3 total, 3 passing 0 4h33m ago
500ef53b app 7 ⇡ sin stop failed 0 4h37m ago
5b78f730 app 7 ⇡ sin stop failed 0 4h39m ago
3d40f194 app 7 ⇡ sin stop failed 0 4h40m ago
ee951a50 app 7 ⇡ sin stop failed 0 4h41m ago
98adb042 app 7 ⇡ sin stop failed 3 total, 3 critical 0 2021-10-03T19:50:29Z
8ca6c001 app 6 sin stop failed 0 2021-10-03T18:50:22Z
49b10ddd app 6 sin stop failed 0 2021-10-03T18:37:10Z
31ab093c app 5 sin stop failed 0 2021-10-03T17:50:28Z
a306c3e3 app 5 sin stop failed 0 2021-10-03T17:37:19Z
c430a194 app 5 sin stop failed 0 2021-10-03T17:20:02Z
02596840 app 5 sin stop failed 0 2021-10-03T17:03:23Z
307767e8 app 5 sin stop failed 0 2021-10-03T16:54:49Z
2d413ce5 app 5 sin stop failed 0 2021-10-03T16:50:48Z
e62b4fce app 5 sin stop failed 0 2021-10-03T16:47:14Z
199311b9 app 5 sin stop failed 0 2021-10-03T16:46:44Z
5821abeb app 5 sin stop failed 3 total, 1 passing, 2 critical 2 2021-10-03T16:41:24Z
1521ab66 app 5 sin stop failed 0 2021-10-03T16:39:59Z
b59b36cf app 4 sin stop failed 0 2021-10-03T16:36:40Z
f7969e46 app 4 sin stop failed 0 2021-10-03T16:04:48Z
851f5d97 app 4 sin stop failed 0 2021-10-03T15:48:19Z
90570902 app 4 sin stop failed 0 2021-10-03T15:40:24Z
742a4728 app 4 sin stop failed 0 2021-10-03T15:35:56Z
654b7404 app 4 sin stop failed 0 2021-10-03T15:33:35Z
5642d15f app 4 sin stop failed 0 2021-10-03T15:32:23Z
6e2eab61 app 4 sin stop failed 0 2021-10-03T15:31:39Z
77648b94 app 4 sin stop failed 3 total, 3 critical 1 2021-09-26T18:06:18Z
Note that I did not push a newer version/image of the DB.
It just happened again… @kurt
fly status -all
shows
Instances
ID TASK VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
e7ad6e45 app 3 fra run running (leader) 3 total, 3 passing 0 23h58m ago
cd1f75b0 app 3 fra stop failed 0 2021-10-04T16:35:11Z
c3991336 app 3 fra stop failed 3 total, 1 passing, 2 critical 1 2021-10-03T17:26:22Z
801adfdd app 3 fra stop failed 3 total, 2 passing, 1 critical 3 2021-10-03T17:10:36Z
4deb1dca app 3 fra stop failed 3 total, 1 passing, 2 critical 3 2021-10-03T17:00:24Z
e4da363d app 3 fra stop failed 3 total, 1 passing, 2 critical 4 2021-10-03T16:46:26Z
452e7fe8 app 3 fra stop failed 3 total, 1 passing, 2 critical 7 2021-09-29T15:22:43Z
So that’s not the problem.
We’re rolling out new Consul services in various regions, we think this will help. Give us some more time to dig into the exact issues here but I’m hoping we have a better fix later this afternoon.
And it’s happening again
2021-10-07T16:02:20.907933950Z app[a8319d13] fra [info] keeper | 2021-10-07T16:02:20.907Z INFO cmd/keeper.go:1505 our db requested role is master
2021-10-07T16:02:20.908999264Z app[a8319d13] fra [info] keeper | 2021-10-07T16:02:20.908Z INFO cmd/keeper.go:1543 already master
2021-10-07T16:02:20.929239637Z app[a8319d13] fra [info] keeper | 2021-10-07T16:02:20.929Z INFO cmd/keeper.go:1676 postgres parameters not changed
2021-10-07T16:02:20.929371297Z app[a8319d13] fra [info] keeper | 2021-10-07T16:02:20.929Z INFO cmd/keeper.go:1703 postgres hba entries not changed
2021-10-07T16:02:21.169834105Z app[a8319d13] fra [info] proxy | 2021-10-07T16:02:21.169Z INFO cmd/proxy.go:304 check timeout timer fired
2021-10-07T16:02:21.169860804Z app[a8319d13] fra [info] proxy | 2021-10-07T16:02:21.169Z INFO cmd/proxy.go:158 Stopping listening
2021-10-07T16:02:32.101632298Z app[a8319d13] fra [info] keeper | 2021-10-07T16:02:32.101Z INFO cmd/keeper.go:1505 our db requested role is master
2021-10-07T16:02:32.102407174Z app[a8319d13] fra [info] keeper | 2021-10-07T16:02:32.102Z INFO cmd/keeper.go:1543 already master
2021-10-07T16:02:32.118596240Z app[a8319d13] fra [info] keeper | 2021-10-07T16:02:32.118Z INFO cmd/keeper.go:1676 postgres parameters not changed
2021-10-07T16:02:32.118780259Z app[a8319d13] fra [info] keeper | 2021-10-07T16:02:32.118Z INFO cmd/keeper.go:1703 postgres hba entries not changed
2021-10-07T16:02:35.703068136Z app[a8319d13] fra [info] proxy | 2021-10-07T16:02:35.702Z INFO cmd/proxy.go:286 proxying to master address {"address": "[fdaa:0:2d8e:a7b:67:0:2457:2]:5433"}
2021-10-07T16:02:43.297418985Z app[a8319d13] fra [info] keeper | 2021-10-07T16:02:43.297Z INFO cmd/keeper.go:1505 our db requested role is master
2021-10-07T16:02:43.298335980Z app[a8319d13] fra [info] keeper | 2021-10-07T16:02:43.298Z INFO cmd/keeper.go:1543 already master
2021-10-07T16:02:43.321491617Z app[a8319d13] fra [info] keeper | 2021-10-07T16:02:43.321Z INFO cmd/keeper.go:1676 postgres parameters not changed
2021-10-07T16:02:43.321680596Z app[a8319d13] fra [info] keeper | 2021-10-07T16:02:43.321Z INFO cmd/keeper.go:1703 postgres hba entries not changed
2021-10-07T16:02:46.951184079Z app[a8319d13] fra [info] proxy | 2021-10-07T16:02:46.950Z INFO cmd/proxy.go:124 Starting proxying
2021-10-07T16:02:46.953021309Z app[a8319d13] fra [info] proxy | 2021-10-07T16:02:46.952Z INFO cmd/proxy.go:268 master address {"address": "[fdaa:0:2d8e:a7b:67:0:2457:2]:5433"}
2021-10-07T16:02:50.703502541Z app[a8319d13] fra [info] proxy | 2021-10-07T16:02:50.702Z INFO cmd/proxy.go:304 check timeout timer fired
2021-10-07T16:02:50.703539851Z app[a8319d13] fra [info] proxy | 2021-10-07T16:02:50.703Z INFO cmd/proxy.go:158 Stopping listening
2021-10-07T16:02:54.494465587Z app[a8319d13] fra [info] keeper | 2021-10-07T16:02:54.494Z INFO cmd/keeper.go:1505 our db requested role is master
2021-10-07T16:02:54.495482691Z app[a8319d13] fra [info] keeper | 2021-10-07T16:02:54.495Z INFO cmd/keeper.go:1543 already master
2021-10-07T16:02:54.517188336Z app[a8319d13] fra [info] keeper | 2021-10-07T16:02:54.516Z INFO cmd/keeper.go:1676 postgres parameters not changed
2021-10-07T16:02:54.517608534Z app[a8319d13] fra [info] keeper | 2021-10-07T16:02:54.517Z INFO cmd/keeper.go:1703 postgres hba entries not changed
2021-10-07T16:03:05.343635336Z app[a8319d13] fra [info] proxy | 2021-10-07T16:03:05.343Z INFO cmd/proxy.go:286 proxying to master address {"address": "[fdaa:0:2d8e:a7b:67:0:2457:2]:5433"}
check timeout timer fired, Stopping listening…
And later
2021-10-07T16:04:43.670033576Z app[a8319d13] fra [info] keeper | 2021-10-07T16:04:43.669Z INFO cmd/keeper.go:1094 our db boot UID is different than the cluster data one, waiting for it to be updated {"bootUUID": "edd638f0-b719-4ac9-a1b8-a6183d405b20", "clusterBootUUID": "040d9b75-7aff-4f4d-b726-6e37092c6756"}
2021-10-07T16:04:45.293758097Z app[a8319d13] fra [info] keeper | 2021-10-07T16:04:45.293Z ERROR cmd/keeper.go:720 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2021-10-07T16:04:47.794762575Z app[a8319d13] fra [info] keeper | 2021-10-07T16:04:47.794Z ERROR cmd/keeper.go:720 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2021-10-07T16:04:49.925324436Z app[a8319d13] fra [info] keeper is healthy, db is healthy, role: master
2021-10-07T16:04:50.295569268Z app[a8319d13] fra [info] keeper | 2021-10-07T16:04:50.295Z ERROR cmd/keeper.go:720 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2021-10-07T16:04:52.031582378Z app[a8319d13] fra [info] proxy | 2021-10-07T16:04:52.031Z INFO cmd/proxy.go:289 not proxying to master address since we aren't in the enabled proxies list {"address": "[fdaa:0:2d8e:a7b:67:0:2457:2]:5433"}
2021-10-07T16:04:52.796538858Z app[a8319d13] fra [info] keeper | 2021-10-07T16:04:52.796Z ERROR cmd/keeper.go:720 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2021-10-07T16:04:54.797452657Z app[a8319d13] fra [info] keeper | 2021-10-07T16:04:54.797Z INFO cmd/keeper.go:1094 our db boot UID is different than the cluster data one, waiting for it to be updated {"bootUUID": "edd638f0-b719-4ac9-a1b8-a6183d405b20", "clusterBootUUID": "040d9b75-7aff-4f4d-b726-6e37092c6756"}
2021-10-07T16:04:54.925143875Z app[a8319d13] fra [info] error connecting to local postgres context deadline exceeded
2021-10-07T16:04:54.925178254Z app[a8319d13] fra [info] checking stolon status
2021-10-07T16:04:55.297561989Z app[a8319d13] fra [info] keeper | 2021-10-07T16:04:55.297Z ERROR cmd/keeper.go:720 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2021-10-07T16:04:57.798035077Z app[a8319d13] fra [info] keeper | 2021-10-07T16:04:57.797Z ERROR cmd/keeper.go:720 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2021-10-07T16:05:00.298994303Z app[a8319d13] fra [info] keeper | 2021-10-07T16:05:00.298Z ERROR cmd/keeper.go:720 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2021-10-07T16:05:02.800328429Z app[a8319d13] fra [info] keeper | 2021-10-07T16:05:02.799Z ERROR cmd/keeper.go:720 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2021-10-07T16:05:03.159353080Z app[a8319d13] fra [info] proxy | 2021-10-07T16:05:03.159Z INFO cmd/proxy.go:124 Starting proxying
2021-10-07T16:05:03.161504023Z app[a8319d13] fra [info] proxy | 2021-10-07T16:05:03.161Z INFO cmd/proxy.go:268 master address {"address": "[fdaa:0:2d8e:a7b:67:0:2457:2]:5433"}
2021-10-07T16:05:05.301526030Z app[a8319d13] fra [info] keeper | 2021-10-07T16:05:05.301Z ERROR cmd/keeper.go:720 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2021-10-07T16:05:05.932521206Z app[a8319d13] fra [info] keeper | 2021-10-07T16:05:05.932Z INFO cmd/keeper.go:1094
Can we stop/start your db? We can switch it to a local consul by stopping all VMs and starting them back up. It’ll cause about 30s of downtime.
Go for it. It doesn’t matter whether you as a person or fly as a service brings my service down
Ok, updated. I just noticed this DB is only running one instance. We don’t completely support that, our service is designed for 2+ VMs at all time for redundancy. The change we just shipped will likely improve the reliability of your DB as is, but there will definitely be issues running DBs with one member.
If you really want to run Postgres with only one VM, you’re better off running the raw Postgres image directly. Believe it or not, the clustering infrastructure in our image makes a single VM less reliable.
Ok, thank you, I’ll adjust that.
My db is down again too from time to time since ~30 min. The app is staging-db
.
fly status
App
Name = staging-db
Owner = ulysse
Version = 7
Status = running
Hostname = staging-db.fly.dev
Instances
ID TASK VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
1df13ea8 app 7 cdg run running (failed to co) 3 total, 2 critical 3 9m55s ago
b6cb9747 app 7 cdg run running (failed to co) 3 total, 1 passing, 2 critical 12 2021-10-03T17:34:29Z
I’ve did a fly scale count 2
for the DB app. That’s correct isn’t it? I was just wondering because fly status
still shows one instance although fly scale show
shows Count 2.
And fly vm status <vm-id>
does not accept the given instance id from fly status
You’ll need to add a volume for it to scale to: fly volumes create pg_data --size 10 --region fra
@kurt My db is still down. Any news?
There is already such a volume:
fly volumes show vol_x915grnn8ern70qy
ID: vol_x915grnn8ern70qy
Name: pg_data
Region: fra
Size GB: 10
Encrypted: false
Created at: 07 Jun 21 14:31 UTC
I guess I shouldn’t create another volume on top of that…
Update 2: I created another volume, and have two instances now, thank you!
@yannis sorry, missed your note. We’re looking!
My db was again offline for a good 5-10 minutes.
My DB and App is offline, the site is posted on Show HN too.
Edit: It’s back up after a deploy.