a) how great fly ssh console is, and
b) how some managed database providers have a command that drops you into a database console (like pscale shell)
Have you considered making a fly postgres console command that drops you into a postgres shell, that uses the secrets system to retrieve the database password or something?
@julia Hey there! I went ahead and added a “connect” option to our latest Postgres images.
If you run fly image show --app <postgres-app> and update to the latest version, you’ll be able to run something like fly ssh console -C "connect" --app <postgres-app> to jump straight to the PG console.
We do have plans on streamlining this through a flyctl pg connect command, or something similar. No promises on when, but hopefully soon.
I created a development postgres app and I don’t see any metrics on its metrics page except “firecracker memory usage” and “data transfer” – are Postgres metrics not supported for development apps? I couldn’t find anything about it in the docs.
$ fly image update --app production-db
? Update `production-db` from flyio/postgres:13.4 v0.0.7 to flyio/postgres:13.5 v0.0.9? Yes
Release v10 created
You can detach the terminal anytime without stopping the update
Monitoring Deployment
2 desired, 1 placed, 0 healthy, 1 unhealthy [health checks: 3 total]
v10 failed - Failed due to unhealthy allocations
Failed Instances
==> Failure #1
Instance
ID = f88b0035
Process =
Version = 10
Region = iad
Desired = run
Status = pending
Health Checks = 3 total
Restarts = 0
Created = 21s ago
Recent Events
TIMESTAMP TYPE MESSAGE
2021-12-14T22:58:36Z Received Task received by client
2021-12-14T22:58:54Z Task Setup Building Task Directory
Recent Logs
2021-12-14T22:58:54.000 [info] Starting instance
2021-12-14T22:58:54.000 [info] Configuring virtual machine
2021-12-14T22:58:54.000 [info] Pulling container image
2021-12-14T22:58:56.000 [info] Unpacking image
2021-12-14T22:59:02.000 [info] Setting up volume 'pg_data'
2021-12-14T22:59:03.000 [info] Starting virtual machine
2021-12-14T22:59:03.000 [info] Starting init (commit: 7943db6)...
2021-12-14T22:59:03.000 [info] Mounting /dev/vdc at /data
2021-12-14T22:59:03.000 [info] 2021/12/14 22:59:03 listening on [fdaa:0:309a:a7b:ab9:0:30e5:2]:22 (DNS: [fdaa::3]:53)
2021-12-14T22:59:03.000 [info] panic: FLY_ETCD_URL or ETCD_URL are required
2021-12-14T22:59:04.000 [info] Main child exited normally with code: 2
2021-12-14T22:59:04.000 [info] Starting clean up.
2021-12-14T22:59:04.000 [info] Umounting /dev/vdc from /data
***v10 failed - Failed due to unhealthy allocations and deploying as v11
Troubleshooting guide at https://fly.io/docs/getting-started/troubleshooting/
$ fly status -a production-db --all
App
Name = production-db
Owner = enaia
Version = 10
Status = running
Hostname = production-db.fly.dev
Deployment Status
ID = 01e50fd7-e3d4-8457-2c7a-87fbd10e38a8
Version = v10
Status = failed
Description = Failed due to unhealthy allocations
Instances = 2 desired, 1 placed, 0 healthy, 1 unhealthy
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
f88b0035 app 10 ⇡ iad run failed 3 total 0 4m2s ago
21176624 app 9 iad stop complete 3 total, 3 passing 0 8h18m ago
4faf45da app 9 iad run running (replica) 3 total, 3 passing 0 8h18m ago
e8c6edd9 app 8 iad stop failed 0 8h18m ago
2e2fa531 app 8 iad stop failed 0 8h18m ago
9aa4ab49 app 8 iad stop failed 0 8h19m ago
57d90a7f app 8 iad stop failed 0 8h19m ago
e41064bc app 8 iad stop failed 3 total 0 8h19m ago
64e284d2 app 8 iad stop failed 3 total 0 8h19m ago
06171c90 app 7 iad stop complete 3 total, 3 passing 0 2021-12-02T22:27:08Z
5d0b0ecf app 7 iad stop complete 3 total, 3 passing 0 2021-12-02T22:27:08Z
$ fly status -a production-db --all
App
Name = production-db
Owner = enaia
Version = 11
Status = running
Hostname = production-db.fly.dev
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
f74e8482 app 11 ⇡ iad run running (replica) 3 total, 3 passing 0 4h24m ago
e12ac95b app 11 ⇡ iad stop failed 3 total, 2 passing, 1 critical 0 2021-12-14T23:06:17Z
3398f650 app 11 ⇡ iad run running (leader) 3 total, 3 passing 0 2021-12-14T23:05:13Z
f88b0035 app 10 iad stop failed 3 total 0 2021-12-14T22:58:45Z
$ fly vm status f88b0035 -a production-db
Instance
ID = f88b0035
Process =
Version = 10
Region = iad
Desired = stop
Status = failed
Health Checks = 3 total
Restarts = 0
Created = 2021-12-14T22:58:45Z
Recent Events
TIMESTAMP TYPE MESSAGE
2021-12-14T22:58:36Z Received Task received by client
2021-12-14T22:58:54Z Task Setup Building Task Directory
2021-12-14T22:59:03Z Started Task started by client
2021-12-14T22:59:05Z Terminated Exit Code: 2
2021-12-14T22:59:05Z Not Restarting Policy allows no restarts
2021-12-14T22:59:05Z Alloc Unhealthy Unhealthy because of failed task
2021-12-14T22:59:06Z Killing Sent interrupt. Waiting 5m0s before force killing
Checks
ID SERVICE STATE OUTPUT
pg app warning
role app warning
vm app warning
Recent Logs
$ fly vm status e12ac95b -a production-db
Instance
ID = e12ac95b
Process =
Version = 11
Region = iad
Desired = stop
Status = failed
Health Checks = 3 total, 2 passing, 1 critical
Restarts = 0
Created = 2021-12-14T23:06:17Z
Recent Events
TIMESTAMP TYPE MESSAGE
2021-12-14T23:06:12Z Received Task received by client
2021-12-14T23:06:30Z Task Setup Building Task Directory
2021-12-14T23:06:40Z Started Task started by client
2021-12-16T17:33:52Z Restart Signaled healthcheck: check "vm" unhealthy
2021-12-16T17:33:56Z Terminated Exit Code: 0
2021-12-16T17:33:56Z Not Restarting Policy allows no restarts
2021-12-16T17:33:56Z Killing Sent interrupt. Waiting 5m0s before force killing
Checks
ID SERVICE STATE OUTPUT
pg app passing HTTP GET http://172.19.0.66:5500/flycheck/pg: 200 OK Output: "[✓] transactions: read/write (3.73ms)\n[✓] replicationLag: fdaa:0:309a:a7b:ab9:0:30e5:2 is lagging 0s (100ns)\n[✓] connections: 29 used, 3 reserved, 300 max (8.86ms)"
vm app critical HTTP GET http://172.19.0.66:5500/flycheck/vm: 500 Internal Server Error Output: "[✓] checkDisk: 9.09 GB (92.9%!)(MISSING) free space on /data/ (977.26µs)\n[✓] checkLoad: load averages: 0.14 0.22 0.25 (412.55µs)\n[✗] memory: system spent 1.03s of the last 10 seconds waiting on memory (54.88µs)\n[✗] cpu: system spent 1.09s of the last 10 seconds waiting on cpu (16.35µs)\n[✓] io: system spent 3.95s of the last 60s waiting on io (14.81µs)"
role app passing leader
Recent Logs
That seems to be failing the memory related health checks occasionally. This can happen when there’s not enough RAM for the database, but it might also be fine.
We just tweaked your DB to not restart when those health checks fail.
Just wondering, is there a technical reason for needing the whole fly-replay stuff? I was wondering why these postgres replicas in different regions don’t have a thin layer to just forward the SQL to the main instance and forward the result back when its a write? This would remove the need for every developer to handle this stuff on their own
We wrote a bit about this on our Global Postgres announcement. The short answer is that it’s gross and unpredictable to do individual writes to a far away Postgres:
A bit more on “gross” here: you can get your database layer to do this kind of stuff for you directly, using something
like pgpool so that the database layer itself knows where to route transactions. But there’s a problem with this: your app doesn’t expect this to happen, and isn’t built to handle it. What you see when you try routing writes at the database connection layer is something like this:
A read query for data, from read replica, perhaps for validation: 0ms.
A write to the primary, in a different region: 20-400ms.
A read query to the primary, for consistency, in a different region: 20-400ms.
More read queries against primary for consistency, in a different region: 20-400ms.
Maybe another write to the primary: 20-400ms.
Repeat .
It is much, much faster to ship the whole HTTP request where it needs to be than it is move the database away from an app instance and route database queries directly. Remember: replay is happening with Fly’s network. HTTP isn’t bouncing back and forth between the user and our edge (that would be slow); it’s happening inside our CDN.
That said, the fly-replay mechanism is designed to be easy to implement as a library. It’s almost magical in Rails and Phoenix, we just haven’t gotten to build the libraries for other frameworks yet.