I believe we had a failure earlier today, and I’m wondering if there’s anything to be concerned about esp. given our recent experience (Database reset, 2 days of data lost - #8 by kurt)
$ fly status -a production-db --all
App
Name = production-db
Owner = enaia
Version = 11
Status = running
Hostname = production-db.fly.dev
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
f74e8482 app 11 ⇡ iad run running (replica) 3 total, 3 passing 0 4h24m ago
e12ac95b app 11 ⇡ iad stop failed 3 total, 2 passing, 1 critical 0 2021-12-14T23:06:17Z
3398f650 app 11 ⇡ iad run running (leader) 3 total, 3 passing 0 2021-12-14T23:05:13Z
f88b0035 app 10 iad stop failed 3 total 0 2021-12-14T22:58:45Z
$ fly vm status f88b0035 -a production-db
Instance
ID = f88b0035
Process =
Version = 10
Region = iad
Desired = stop
Status = failed
Health Checks = 3 total
Restarts = 0
Created = 2021-12-14T22:58:45Z
Recent Events
TIMESTAMP TYPE MESSAGE
2021-12-14T22:58:36Z Received Task received by client
2021-12-14T22:58:54Z Task Setup Building Task Directory
2021-12-14T22:59:03Z Started Task started by client
2021-12-14T22:59:05Z Terminated Exit Code: 2
2021-12-14T22:59:05Z Not Restarting Policy allows no restarts
2021-12-14T22:59:05Z Alloc Unhealthy Unhealthy because of failed task
2021-12-14T22:59:06Z Killing Sent interrupt. Waiting 5m0s before force killing
Checks
ID SERVICE STATE OUTPUT
pg app warning
role app warning
vm app warning
Recent Logs
$ fly vm status e12ac95b -a production-db
Instance
ID = e12ac95b
Process =
Version = 11
Region = iad
Desired = stop
Status = failed
Health Checks = 3 total, 2 passing, 1 critical
Restarts = 0
Created = 2021-12-14T23:06:17Z
Recent Events
TIMESTAMP TYPE MESSAGE
2021-12-14T23:06:12Z Received Task received by client
2021-12-14T23:06:30Z Task Setup Building Task Directory
2021-12-14T23:06:40Z Started Task started by client
2021-12-16T17:33:52Z Restart Signaled healthcheck: check "vm" unhealthy
2021-12-16T17:33:56Z Terminated Exit Code: 0
2021-12-16T17:33:56Z Not Restarting Policy allows no restarts
2021-12-16T17:33:56Z Killing Sent interrupt. Waiting 5m0s before force killing
Checks
ID SERVICE STATE OUTPUT
pg app passing HTTP GET http://172.19.0.66:5500/flycheck/pg: 200 OK Output: "[✓] transactions: read/write (3.73ms)\n[✓] replicationLag: fdaa:0:309a:a7b:ab9:0:30e5:2 is lagging 0s (100ns)\n[✓] connections: 29 used, 3 reserved, 300 max (8.86ms)"
vm app critical HTTP GET http://172.19.0.66:5500/flycheck/vm: 500 Internal Server Error Output: "[✓] checkDisk: 9.09 GB (92.9%!)(MISSING) free space on /data/ (977.26µs)\n[✓] checkLoad: load averages: 0.14 0.22 0.25 (412.55µs)\n[✗] memory: system spent 1.03s of the last 10 seconds waiting on memory (54.88µs)\n[✗] cpu: system spent 1.09s of the last 10 seconds waiting on cpu (16.35µs)\n[✓] io: system spent 3.95s of the last 60s waiting on io (14.81µs)"
role app passing leader
Recent Logs