Hi,
We’re currently having recurring issues with our postgres machine. We’ve tried scaling the machine and playing around with the configuration. It seems to be failing every 7~ hours and we can’t seem to find a solution. Health checks are passing. Any help appreciated.
Here is machine config:
Machine ID: d8dd671ce653d8
Instance ID: 01HXRA89CMM4FQN2ZQSECNNA11
State: started
VM
ID = d8dd671ce653d8
Instance ID = 01HXRA89CMM4FQN2ZQSECNNA11
State = started
Image = flyio/postgres-flex:15.6 (v0.0.51)
Name = spring-sunset-4966
Private IP = fdaa:5:8e54:a7b:18:96db:5d92:2
Region = lhr
Process Group =
CPU Kind = performance
vCPUs = 1
Memory = 2048
Created = 2024-05-03T10:00:58Z
Updated = 2024-05-13T06:29:19Z
Entrypoint =
Command =
Volume = vol_40k9zejm8zw9z7mv
Event Logs
STATE EVENT SOURCE TIMESTAMP INFO
started start flyd 2024-05-13T07:29:19.336+01:00
created launch user 2024-05-13T07:29:16.849+01:00
Config:
{
"env": {
"PRIMARY_REGION": "lhr"
},
"init": {},
"guest": {
"cpu_kind": "performance",
"cpus": 1,
"memory_mb": 2048
},
"metadata": {
"fly-managed-postgres": "true",
"fly_flyctl_version": "0.2.47",
"fly_platform_version": "v2",
"managed-by-fly-deploy": "true"
},
"mounts": [
{
"encrypted": true,
"path": "/data",
"size_gb": 10,
"volume": "vol_40k9zejm8zw9z7mv",
"name": "pg_data"
}
],
"services": [
{
"protocol": "tcp",
"internal_port": 5432,
"autostop": false,
"autostart": false,
"ports": [
{
"port": 5432,
"handlers": [
"pg_tls"
]
}
],
"concurrency": {
"type": "connections",
"hard_limit": 1000,
"soft_limit": 1000
},
"force_instance_key": null
},
{
"protocol": "tcp",
"internal_port": 5433,
"autostop": false,
"autostart": false,
"ports": [
{
"port": 5433,
"handlers": [
"pg_tls"
]
}
],
"concurrency": {
"type": "connections",
"hard_limit": 1000,
"soft_limit": 1000
},
"force_instance_key": null
}
],
"metrics": {
"port": 9187,
"path": "/metrics"
},
"checks": {
"pg": {
"port": 5500,
"type": "http",
"interval": "15s",
"timeout": "10s",
"path": "/flycheck/pg"
},
"role": {
"port": 5500,
"type": "http",
"interval": "15s",
"timeout": "10s",
"path": "/flycheck/role"
},
"vm": {
"port": 5500,
"type": "http",
"interval": "15s",
"timeout": "10s",
"path": "/flycheck/vm"
}
},
"image": "flyio/postgres-flex:15.6@sha256:92917e5770cef6666dddbf17061c2f95b9e19b8155be9a8ce8c35e09e5381167",
"restart": {
"policy": "on-failure"
}
}
Here are logs. When it fails we immediately manually restart it:
2024-05-14 19:14:41.762 repmgrd | [2024-05-14 18:14:41] [NOTICE] monitoring cluster primary "fdaa:5:8e54:a7b:18:96db:5d92:2" (ID: 374181770)
2024-05-14 19:14:41.762 repmgrd | [2024-05-14 18:14:41] [INFO] "connection_check_type" set to "ping"
2024-05-14 19:14:41.762 repmgrd | [2024-05-14 18:14:41] [NOTICE] starting monitoring of node "fdaa:5:8e54:a7b:18:96db:5d92:2" (ID: 374181770)
2024-05-14 19:14:41.762 repmgrd | INFO: set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
2024-05-14 19:14:41.744 repmgrd | [2024-05-14 18:14:41] [INFO] connecting to database "host=fdaa:5:8e54:a7b:18:96db:5d92:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5"
2024-05-14 19:14:41.744 repmgrd | [2024-05-14 18:14:41] [NOTICE] repmgrd (repmgrd 5.3.3) starting up
2024-05-14 19:14:41.706 repmgrd | Running...
2024-05-14 19:14:38.688 proxy | [WARNING] (408) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.
2024-05-14 19:14:38.688 proxy | [WARNING] (408) : Server bk_db/pg1 ('lhr.openformat-db-production.internal') is UP/READY (resolves again).
2024-05-14 19:14:38.688 proxy | [WARNING] (408) : bk_db/pg1 changed its IP from (none) to fdaa:5:8e54:a7b:18:96db:5d92:2 by flydns/dns1.
2024-05-14 19:14:38.687 proxy | [NOTICE] (406) : Loading success.
2024-05-14 19:14:38.687 proxy | [NOTICE] (406) : New worker (408) forked
2024-05-14 19:14:38.645 proxy | Running...
2024-05-14 19:14:37.645 proxy | restarting in 1s [attempt 1]
2024-05-14 19:14:37.645 proxy | Process exited 0
2024-05-14 19:14:37.643 proxy | [WARNING] (357) : All workers exited. Exiting... (0)
2024-05-14 19:14:37.643 proxy | [ALERT] (357) : Current worker (377) exited with code 143 (Terminated)
2024-05-14 19:14:37.643 proxy | [NOTICE] (357) : path to executable is /usr/sbin/haproxy
2024-05-14 19:14:37.643 proxy | [NOTICE] (357) : haproxy version is 2.8.9-1~bpo12+1
2024-05-14 19:14:37.533 Voting member(s): 1, Active: 1, Inactive: 0, Conflicts: 0
2024-05-14 19:14:36.883 Machine started in 1.139s
2024-05-14 19:14:36.778 postgres | 2024-05-14 18:14:36.777 UTC [341] LOG: database system is ready to accept connections
2024-05-14 19:14:36.767 postgres | 2024-05-14 18:14:36.767 UTC [382] LOG: database system was shut down at 2024-05-14 18:14:35 UTC
2024-05-14 19:14:36.764 postgres | 2024-05-14 18:14:36.764 UTC [341] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5433"
2024-05-14 19:14:36.762 postgres | 2024-05-14 18:14:36.762 UTC [341] LOG: listening on IPv6 address "::", port 5433
2024-05-14 19:14:36.762 postgres | 2024-05-14 18:14:36.762 UTC [341] LOG: listening on IPv4 address "0.0.0.0", port 5433
2024-05-14 19:14:36.762 postgres | 2024-05-14 18:14:36.762 UTC [341] LOG: starting PostgreSQL 15.6 (Debian 15.6-1.pgdg120+2) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2024-05-14 19:14:36.710 proxy | [NOTICE] (357) : Loading success.
2024-05-14 19:14:36.710 proxy | [NOTICE] (357) : New worker (377) forked
2024-05-14 19:14:36.705 repmgrd | restarting in 5s [attempt 1]
2024-05-14 19:14:36.705 repmgrd | exit status 6
2024-05-14 19:14:36.704 repmgrd | user=repmgr connect_timeout=5 dbname=repmgr host=fdaa:5:8e54:a7b:18:96db:5d92:2 port=5433 fallback_application_name=repmgr options=-csearch_path=
2024-05-14 19:14:36.704 repmgrd | [2024-05-14 18:14:36] [DETAIL] attempted to connect using:
2024-05-14 19:14:36.704 repmgrd |
2024-05-14 19:14:36.704 repmgrd | Is the server running on that host and accepting TCP/IP connections?
2024-05-14 19:14:36.704 repmgrd | connection to server at "fdaa:5:8e54:a7b:18:96db:5d92:2", port 5433 failed: Connection refused
2024-05-14 19:14:36.704 repmgrd | [2024-05-14 18:14:36] [DETAIL]
2024-05-14 19:14:36.704 repmgrd | [2024-05-14 18:14:36] [ERROR] connection to database failed
2024-05-14 19:14:36.695 repmgrd | [2024-05-14 18:14:36] [INFO] connecting to database "host=fdaa:5:8e54:a7b:18:96db:5d92:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5"
2024-05-14 19:14:36.695 repmgrd | [2024-05-14 18:14:36] [NOTICE] repmgrd (repmgrd 5.3.3) starting up
2024-05-14 19:14:36.626 monitor | Pruning every 24h0m0s...
2024-05-14 19:14:36.618 monitor | Monitoring replication slots
2024-05-14 19:14:36.618 monitor | Monitoring cluster state
2024-05-14 19:14:36.618 monitor | Monitoring dead members
2024-05-14 19:14:36.525 admin | Running...
2024-05-14 19:14:36.525 exporter | Running...
2024-05-14 19:14:36.523 proxy | Running...
2024-05-14 19:14:36.522 repmgrd | Running...
2024-05-14 19:14:36.515 monitor | Running...
2024-05-14 19:14:36.514 postgres | Running...
2024-05-14 19:14:36.353 2024/05/14 18:14:36 INFO SSH listening listen_address=[fdaa:5:8e54:a7b:18:96db:5d92:2]:22 dns_server=[fdaa::3]:53
2024-05-14 19:14:36.347 INFO [fly api proxy] listening at /.fly/api
2024-05-14 19:14:36.340 INFO Preparing to run: `docker-entrypoint.sh start` as root
2024-05-14 19:14:36.339 INFO Resized /data to 10720641024 bytes
2024-05-14 19:14:36.336 INFO Mounting /dev/vdb at /data w/ uid: 0, gid: 0 and chmod 0755
2024-05-14 19:14:36.266 INFO Starting init (commit: d772ddd9)...
2024-05-14 19:14:36.076 [ 0.164970] PCI: Fatal: No config space access function found
2024-05-14 19:14:36.026 [ 0.156061] Spectre V2 : WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible via Spectre v2 BHB attacks!
2024-05-14 19:14:35.122 [34141.938099] reboot: Restarting system
2024-05-14 19:14:35.118 INFO Umounting /dev/vdb from /data
2024-05-14 19:14:35.117 INFO Starting clean up.
2024-05-14 19:14:35.103 INFO Main child exited normally with code: 0
2024-05-14 19:14:35.101 postgres | Process exited 0
2024-05-14 19:14:35.100 postgres | 2024-05-14 18:14:35.100 UTC [342] LOG: database system is shut down
2024-05-14 19:14:35.091 postgres | 2024-05-14 18:14:35.091 UTC [381] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.001 s, total=0.004 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=5 kB
2024-05-14 19:14:35.088 repmgrd | Process exited 0
2024-05-14 19:14:35.087 postgres | 2024-05-14 18:14:35.087 UTC [381] LOG: checkpoint starting: shutdown immediate
2024-05-14 19:14:35.087 repmgrd | [2024-05-14 18:14:35] [INFO] repmgrd terminating...
2024-05-14 19:14:35.087 repmgrd | SELECT pg_catalog.pg_is_in_recovery()
2024-05-14 19:14:35.087 repmgrd | [2024-05-14 18:14:35] [DETAIL] query text is:
2024-05-14 19:14:35.087 repmgrd |
2024-05-14 19:14:35.087 repmgrd | before or while processing the request.
2024-05-14 19:14:35.087 repmgrd | This probably means the server terminated abnormally
2024-05-14 19:14:35.087 repmgrd | server closed the connection unexpectedly
2024-05-14 19:14:35.087 repmgrd | FATAL: terminating connection due to administrator command
2024-05-14 19:14:35.087 repmgrd | [2024-05-14 18:14:35] [DETAIL]
2024-05-14 19:14:35.087 repmgrd | [2024-05-14 18:14:35] [ERROR] unable to determine if server is in recovery
2024-05-14 19:14:35.087 postgres | 2024-05-14 18:14:35.087 UTC [381] LOG: shutting down
2024-05-14 19:14:35.087 postgres | 2024-05-14 18:14:35.085 UTC [438] FATAL: terminating connection due to administrator command
2024-05-14 19:14:35.087 postgres | 2024-05-14 18:14:35.084 UTC [424] FATAL: terminating connection due to administrator command
2024-05-14 19:14:35.087 postgres | 2024-05-14 18:14:35.083 UTC [342] LOG: background worker "logical replication launcher" (PID 386) exited with exit code 1
2024-05-14 19:14:35.087 postgres | 2024-05-14 18:14:35.080 UTC [342] LOG: aborting any active transactions
2024-05-14 19:14:35.087 proxy | [WARNING] (405) : All workers exited. Exiting... (130)
2024-05-14 19:14:35.087 proxy | [ALERT] (405) : Current worker (407) exited with code 130 (Interrupt)
2024-05-14 19:14:35.087 proxy | exit status 130
2024-05-14 19:14:35.087 exporter | signal: interrupt
2024-05-14 19:14:35.087 admin | signal: interrupt
2024-05-14 19:14:35.080 repmgrd | [2024-05-14 18:14:35] [NOTICE] INT signal received
2024-05-14 19:14:35.080 postgres | 2024-05-14 18:14:35.078 UTC [342] LOG: received fast shutdown request
2024-05-14 19:14:35.080 monitor | signal: interrupt
2024-05-14 19:14:35.080 proxy | [WARNING] (405) : Exiting Master process...
2024-05-14 19:14:35.080 proxy | [NOTICE] (405) : path to executable is /usr/sbin/haproxy
2024-05-14 19:14:35.080 proxy | [NOTICE] (405) : haproxy version is 2.8.9-1~bpo12+1
2024-05-14 19:14:35.080 monitor | Stopping interrupt...
2024-05-14 19:14:35.080 admin | Stopping interrupt...
2024-05-14 19:14:35.080 repmgrd | Stopping interrupt...
2024-05-14 19:14:35.080 exporter | Stopping interrupt...
2024-05-14 19:14:35.080 proxy | Stopping interrupt...
2024-05-14 19:14:35.077 postgres | Stopping interrupt...
2024-05-14 19:14:35.077 Supervisor stopping
2024-05-14 19:14:35.077 Got interrupt, stopping
2024-05-14 19:14:35.077 INFO Sending signal SIGINT to main child process w/ PID 323
2024-05-14 19:14:04.677 repmgrd | [2024-05-14 18:14:04] [INFO] monitoring primary node "fdaa:5:8e54:a7b:18:96db:5d92:2" (ID: 374181770) in normal state
2024-05-14 19:10:33.906 monitor | Voting member(s): 1, Active: 1, Inactive: 0, Conflicts: 0
2024-05-14 19:09:02.957 repmgrd | [2024-05-14 18:09:02] [INFO] monitoring primary node "fdaa:5:8e54:a7b:18:96db:5d92:2" (ID: 374181770) in normal state
2024-05-14 19:05:33.905 monitor | Voting member(s): 1, Active: 1, Inactive: 0, Conflicts: 0
2024-05-14 19:04:01.185 repmgrd | [2024-05-14 18:04:01] [INFO] monitoring primary node "fdaa:5:8e54:a7b:18:96db:5d92:2" (ID: 374181770) in normal state
2024-05-14 19:00:33.905 monitor | Voting member(s): 1, Active: 1, Inactive: 0, Conflicts: 0
2024-05-14 18:58:59.393 repmgrd | [2024-05-14 17:58:59] [INFO] monitoring primary node "fdaa:5:8e54:a7b:18:96db:5d92:2" (ID: 374181770) in normal state
2024-05-14 18:55:33.904 monitor | Voting member(s): 1, Active: 1, Inactive: 0, Conflicts: 0
2024-05-14 18:53:57.727 repmgrd | [2024-05-14 17:53:57] [INFO] monitoring primary node "fdaa:5:8e54:a7b:18:96db:5d92:2" (ID: 374181770) in normal state
2024-05-14 18:50:33.903 monitor | Voting member(s): 1, Active: 1, Inactive: 0, Conflicts: 0
2024-05-14 18:48:56.041 repmgrd | [2024-05-14 17:48:56] [INFO] monitoring primary node "fdaa:5:8e54:a7b:18:96db:5d92:2" (ID: 374181770) in normal state