Postgres is down, cannot restart. Error no active leader found.

Hi fly.io community,

I migrated from Heroku back in August. Everything was working really well until recently when my postgres server just stopped working (without any prompting from me, I think).

The machine for my postgres server seems to be stuck in “starting” as far as I can tell.

When I try to connect:

$ flyctl postgres connect -a rep-db
Error no active leader found

When I try to restart the machine:

$ flyctl machine restart 32871e1f692e85 -a rep-db
Restarting machine 32871e1f692e85
failed to release lease for machine 32871e1f692e85: lease not foundError failed to restart machine 32871e1f692e85: could not stop machine 32871e1f692e85: failed to restart VM 32871e1f692e85: failed to wait for machine to be started

What is my next step? I noticed in the fly.io postgres documentation that as of flyctl version v0.0.412, the postgres clusters are created using “next-gen Apps V2 architecture, built on Fly Machines” instead of on Nomad architecture.

Is this the root of my issue? I’m using flyctl version 0.0.435 (fly v0.0.435 darwin/amd64 Commit: c5149629 BuildDate: 2022-11-22T16:36:15Z)

Thank you

2 Likes

In case this is helpful, when I try to deploy my Django application, I get this error:

--> You can detach the terminal anytime without stopping the deployment
==> Release command detected: python manage.py migrate

--> This release will not be available until the release command succeeds.
	 Starting instance
	 Configuring virtual machine
	 Unpacking image
	 Preparing kernel init
	 UUID=c3b8aabc-c66f-40d3-bc0c-900018c3ae63
	 Preparing to run: `python manage.py migrate` as root
	 2022/11/28 02:52:54 listening on [fdaa:0:88d8:a7b:e770:b2b0:2e09:2]:22 (DNS: [fdaa::3]:53)
	 Traceback (most recent call last):
	     return func(*args, **kwargs)
	            ^^^^^^^^^^^^^^^^^^^^^
	   File "/usr/local/lib/python3.1
	     connection = Database.connect(**conn_params)
	 psycopg2.OperationalError: could not translate host name "top2.nearest.of.rep-db.internal" to address: Name does not resolve
	 Traceback (most recent call last):
	   File "/usr/local/lib/python3.11/site-packages/django/core/management/base.py", line 354, in run_from_argv
	   File "/usr/local/lib/python3.11/site-packages/django/core/management/base.py", line 398, in execute
	   File "/usr/local/lib/python3.11/site-packages/django/core/management/base.py", line 89, in wrapped
	           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	   File "/usr/local/lib/python3.11/site-packages/django/core/management/commands/migrate.py", line 75, in handle
	   File "/usr/local/lib/python3.11/site-packages/django/core/management/base.py", line 419, in check
	     all_issues = checks.run_checks(
	      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	   File "/usr/local/lib/python3.11/site-packages/django/db/models/base.py", line 1682, in _check_indexes
	     connection.features.supports_covering_indexes or
	     res = instance.__dict__[self.name] = self.func(instance)
	     return next(self.gen)
	     with self.cursor() as cursor:
	     return func(*args, **kwargs)
	            ^^^^^^^^^^^^^^^^^^^^^
	     return func(*args, **kwargs)
	                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	   File "/usr/local/lib/python3.11/site-packages/psycopg2/__init__.py", line 122, in connect
	     conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
	 django.db.utils^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

jango.db.utils.OperationalError: could not translate host name "top2.nearest.of.rep-db.internal" to address: Name does not resolve
	 Starting clean up.

rep-db is the name of the application for my postgres instance

I’ve also restore a snapshot into a 2nd postgres instance which seems to be doing just fine (and thankfully has all the data I need). However, I cannot detach the original instance that is broken. I get the same Error no active leader found error message:

$ fly postgres detach rep-db
Error no active leader found

I can’t attach the new db instance before detaching the current one.

Same issue here.

After seeing this in the documentation:

This Is Not Managed Postgres

Before you use Fly Postgres, here are some things worth understanding about it:

Fly Postgres is a regular app you deploy on Fly.io, with an automated creation process and some platform integration to simplify management. It relies on building blocks available to all Fly apps, like flyctl, volumes, private networking, health checks, logs, metrics, and more. The source code is available on GitHub to view and fork.

This is not a managed database. If Postgres crashes because it ran out of memory or disk space, you’ll need to do a little work to get it back.

I realized what I actually need & want is a managed Postgres service. Using the 2nd instance of Postgres based on the snapshot, I migrated my Postgres database to another managed provider.

1 Like

This sounds like the instance might have run out of disk space. If you run into an issue like this in the future, run fly checks list. This should tell you what’s actually failing on the Postgres instance.

With a fresh postgres deployment and absolute no use, I get this error.

flyctl postgres connect -a correct_db_name
=> Error no active leader found

Hard to understand what’s going on.

1 Like

I have the same issue. I saw a restart via the console (“by Fly Admin Bot 4 days ago”) as well as setting a new secret for FLY_CONSUL_URL.

$ flyctl pg restart -a myapp
Error no active leader found
$ flyctl checks -c myapp.toml
Health Checks for myapp
  NAME | STATUS | MACHINE | LAST UPDATED | OUTPUT
-------*--------*---------*--------------*---------
$ flyctl ping myapp.internal
Error get app: Could not find App

I had this same issue happen. It appears to have been related to some internal DNS issue, or perhaps it caused the DNS issue. top2.nearest.of.<my database>.internal became unreachable this morning, and I found this issue at the root of it. fly dig top2.nearest.of.<my database>.internal return empty records.

I got around this issue by restoring a snapshot to a new postgres app and attaching that to its associated webserver. Thankfully this was a staging environment, where a rollback of 22 hours would be acceptable. I don’t think this would be an acceptable strategy in a production setting, though. Any work being done on stability here, or otherwise enabling workarounds that keep data integrity?

If your DB crashed, top2.nearest.of.app.internal will return empty results. Restoring to a new DB will get you moving, but you should investigate the health of the previous Postgres app and see what’s up.

I am also seeing this issue, flyctl checks list isn’t very helpful:

❯ flyctl checks list --config fly/db.toml 
Update available 0.0.463 -> 0.0.473.
Run "flyctl version update" to upgrade.
Health Checks for solitary-sun-2613
  NAME | STATUS  | MACHINE        | LAST UPDATED         | OUTPUT                     
-------*---------*----------------*----------------------*----------------------------
  pg   | warning | 73d8d3d6a72389 | 2023-02-23T10:59:03Z | waiting for status update  
-------*---------*----------------*----------------------*----------------------------
  role | warning | 73d8d3d6a72389 | 2023-02-23T10:59:03Z | waiting for status update  
-------*---------*----------------*----------------------*----------------------------
  vm   | warning | 73d8d3d6a72389 | 2023-02-23T10:59:03Z | waiting for status update  
-------*---------*----------------*----------------------*----------------------------

For anyone interested I wasn’t able to track down what was wrong but I did resolve the problem.

I initially tried restarting postgres but the command errored:

❯ flyctl pg restart --config fly/db.toml
Update available 0.0.463 -> 0.0.473.
Run "flyctl version update" to upgrade.
Error no active leader found

I tried upgrading the image hoping it would force a restart but unfortunately nope:

❯ flyctl image update --config fly/db.toml 
Update available 0.0.463 -> 0.0.474.
Run "flyctl version update" to upgrade.
The following changes will be applied to all Postgres machines.
Machines not running the official Postgres image will be skipped.

  	... // 3 identical lines
  		},
  		"init": {},
- 		"image": "flyio/postgres:14.6",
+ 		"image": "registry-1.docker.io/flyio/postgres:14.6@sha256:9cfb3fafcc1b9bc2df7c901d2ae4a81e83ba224bfe79b11e4dc11bb1838db46e",
  		"metadata": {
  			"fly-managed-postgres": "true",
  	... // 46 identical lines
  	
? Apply changes? Yes
Identifying cluster role(s)
  Machine 73d8d3d6a72389: error
Postgres cluster has been successfully updated!

I think thought I’d try and scale down and then back up - I tried scaling but that errored:

❯ flyctl scale count 2 --config fly/db.toml 
Update available 0.0.463 -> 0.0.474.
Run "flyctl version update" to upgrade.
Error it looks like your app is running on v2 of our platform, and does not support this legacy command: try running fly machine clone instead

The v2 platform doesn’t seem to support scaling but there was a restart command:

❯ flyctl machine restart 73d8d3d6a72389 --config fly/db.toml              
Update available 0.0.463 -> 0.0.474.
Run "flyctl version update" to upgrade.
Restarting machine 73d8d3d6a72389
  Waiting for 73d8d3d6a72389 to become healthy (started, 3/3)
Machine 73d8d3d6a72389 restarted successfully!

And we’re back to being healthy:

❯ flyctl checks list --config fly/db.toml                   
Update available 0.0.463 -> 0.0.474.
Run "flyctl version update" to upgrade.
Health Checks for solitary-sun-2613
  NAME | STATUS  | MACHINE        | LAST UPDATED         | OUTPUT                                                                   
-------*---------*----------------*----------------------*--------------------------------------------------------------------------
  pg   | passing | 73d8d3d6a72389 | 54s ago              | [✓] transactions: read/write (245.12µs)                                  
       |         |                |                      | [✓] connections: 13 used, 3 reserved, 300 max (5.43ms)                   
-------*---------*----------------*----------------------*--------------------------------------------------------------------------
  role | passing | 73d8d3d6a72389 | 57s ago              | leader                                                                   
-------*---------*----------------*----------------------*--------------------------------------------------------------------------
  vm   | passing | 73d8d3d6a72389 | 2023-02-23T11:08:33Z | [✓] checkDisk: 827.39 MB (84.8%) free space on /data/ (60.61µs)          
       |         |                |                      | [✓] checkLoad: load averages: 0.05 0.16 0.31 (109.21µs)                  
       |         |                |                      | [✓] memory: system spent 0s of the last 60s waiting on memory (37.74µs)  
       |         |                |                      | [✓] cpu: system spent 5.75s of the last 60s waiting on cpu (23.74µs)     
       |         |                |                      | [✓] io: system spent 60ms of the last 60s waiting on io (22.24µs)        
-------*---------*----------------*----------------------*-------------------------------------------------------------------------

I don’t know why this fixed the problem or what the problem was but it is now resolved.

7 Likes

I have the same issue, happened these last three days:

% flyctl -a xxx-db pg restart
Error no active leader found

Previous errors have been reported at https://community.fly.io/t/cant-reach-database-server-in-fra/6101/24

This has happened despite my having added a second machine (now I have an instance in cdg and one in ams). In the previous thread @kurt suggested the problem would eventually go away with replication, it seems it’s not sufficient.

edit: the way i was getting out of this jam previously was to restart the db machine, it’s not working for me right now, every flyctl pg <command> fails with this “no active leader” error and I’m locked out of my db.

@theo-m What does fly status look like?

Anything in the logs that could be helpful?

 % flyctl -a xxx-db status
ID              STATE   ROLE    REGION  HEALTH CHECKS                   IMAGE                           CREATED                 UPDATED              
xxxxxxxxxxxxxx  started error   cdg     3 total                         flyio/postgres:14.6 (v0.0.34)   2023-01-23T18:05:01Z    2023-03-29T15:22:05Z    
xxxxxxxxxxxxxx  started replica ams     3 total, 2 passing, 1 critical  flyio/postgres:14.6 (v0.0.34)   2023-03-28T13:57:34Z    2023-03-29T15:13:00Z    

I’m unclear what logs would that be: flyctl machine -h doesn’t list one

edit: my bad, running flyctl -a xxx-db logs does yield stuff:

2023-03-29T15:24:05Z app[148e392f7294d8] cdg [info]keeper   | 2023-03-29T15:24:05.962Z  ERROR   cmd/keeper.go:719       cannot get configured pg parameters     {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2023-03-29T15:24:06Z app[178190eb364789] ams [info]keeper   | 2023-03-29 15:24:06.658 UTC [3554] FATAL:  could not connect to the primary server: connection to server at "fdaa:0:4bfd:a7b:5adc:8d36:d998:2", port 5433 failed: Connection refused
2023-03-29T15:24:06Z app[178190eb364789] ams [info]keeper   |           Is the server running on that host and accepting TCP/IP connections?
2023-03-29T15:24:07Z app[148e392f7294d8] cdg [info]checking stolon status
2023-03-29T15:24:07Z app[178190eb364789] ams [info]sentinel | 2023-03-29T15:24:07.944Z  WARN    cmd/sentinel.go:276     no keeper info available        {"db": "18e2f6a7", "keeper": "aebf7088dee22"}
2023-03-29T15:24:07Z app[178190eb364789] ams [info]sentinel | 2023-03-29T15:24:07.950Z  ERROR   cmd/sentinel.go:1018    no eligible masters
2023-03-29T15:24:08Z app[148e392f7294d8] cdg [info]keeper   | 2023-03-29T15:24:08.462Z  ERROR   cmd/keeper.go:719       cannot get configured pg parameters     {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2023-03-29T15:24:08Z app[148e392f7294d8] cdg [info]checking stolon status
2023-03-29T15:24:08Z app[148e392f7294d8] cdg [info]checking stolon status
2023-03-29T15:24:10Z app[148e392f7294d8] cdg [info]keeper   | 2023-03-29 15:24:10.482 UTC [1336] LOG:  starting PostgreSQL 14.6 (Debian 14.6-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2023-03-29T15:24:10Z app[148e392f7294d8] cdg [info]keeper   | 2023-03-29 15:24:10.482 UTC [1336] LOG:  listening on IPv6 address "fdaa:0:4bfd:a7b:5adc:8d36:d998:2", port 5433
2023-03-29T15:24:10Z app[148e392f7294d8] cdg [info]keeper   | 2023-03-29 15:24:10.483 UTC [1336] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5433"
2023-03-29T15:24:10Z app[148e392f7294d8] cdg [info]keeper   | 2023-03-29 15:24:10.486 UTC [1337] LOG:  database system shutdown was interrupted; last known up at 2023-03-29 15:24:05 UTC
2023-03-29T15:24:10Z app[148e392f7294d8] cdg [info]keeper   | 2023-03-29 15:24:10.546 UTC [1337] LOG:  database system was not properly shut down; automatic recovery in progress
2023-03-29T15:24:10Z app[148e392f7294d8] cdg [info]keeper   | 2023-03-29 15:24:10.547 UTC [1337] LOG:  redo starts at D/82000028
2023-03-29T15:24:10Z app[148e392f7294d8] cdg [info]keeper   | 2023-03-29 15:24:10.547 UTC [1337] LOG:  redo done at D/82000110 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-03-29T15:24:10Z app[148e392f7294d8] cdg [info]keeper   | 2023-03-29 15:24:10.556 UTC [1337] PANIC:  could not write to file "pg_wal/xlogtemp.1337": No space left on device
2023-03-29T15:24:10Z app[148e392f7294d8] cdg [info]keeper   | 2023-03-29 15:24:10.557 UTC [1336] LOG:  startup process (PID 1337) was terminated by signal 6: Aborted
2023-03-29T15:24:10Z app[148e392f7294d8] cdg [info]keeper   | 2023-03-29 15:24:10.557 UTC [1336] LOG:  aborting startup due to startup process failure
2023-03-29T15:24:10Z app[148e392f7294d8] cdg [info]keeper   | 2023-03-29 15:24:10.559 UTC [1336] LOG:  database system is shut down
2023-03-29T15:24:10Z app[148e392f7294d8] cdg [info]checking stolon status
2023-03-29T15:24:10Z app[148e392f7294d8] cdg [info]keeper   | 2023-03-29T15:24:10.666Z  ERROR   cmd/keeper.go:1526      failed to start postgres        {"error": "postgres exited unexpectedly"}
2023-03-29T15:24:10Z app[148e392f7294d8] cdg [info]keeper   | 2023-03-29T15:24:10.964Z  ERROR   cmd/keeper.go:719       cannot get configured pg parameters     {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2023-03-29T15:24:11Z app[148e392f7294d8] cdg [info]checking stolon status
2023-03-29T15:24:11Z app[178190eb364789] ams [info]keeper   | 2023-03-29 15:24:11.659 UTC [3576] FATAL:  could not connect to the primary server: connection to server at "fdaa:0:4bfd:a7b:5adc:8d36:d998:2", port 5433 failed: Connection refused

oook so looks like it’s a storage thing, i’ve set the volumes on each machine to 5gb

now the machines are not panicking, but still haven’t recovered from the no active leader thing

Restarting machines not enough, still stuck. I’d expect pg failover to work, but also complains of no active leader.

I’m still locked out of my db, I’ve tried downscaling to one db machine, to three, restarting all of them, some of them, etc., still getting the no active leader thing.

How I got out of the jam: when booting up a new pg machine, I could open a proxy and pg_dump the content of the db. Then created a whole new fly db, restored the data there, unset the DATABASE_URL on my app, which unabled me to attach the new db and delete the old one.

I’ve also set up a neon.tech account because while I’m a big fan of fly, I don’t think an “unmanaged” db experience should entail these kind of surprises. Maybe I’m wrong, but it feels like this no active leader issue is not a postgres issue but really a fly orchestrator thing.

Things I think you could improve - it’s an outsider take, maybe I’m misunderstanding things, but it’s a good faith effort:

  • I don’t expect flyctl pg failover to fail if there’s no active leader, it seems like the whole purpose of the failover is to elect a new leader?
  • Nowhere is the “no active leader” thing echoed in the UI, things are displayed as healthy, maybe the health system for the db should look for that?
  • Allow manually electing a leader machine?
2 Likes

Man, you just saved me. Thanks a lot, spent almost 2h trying to fix that and encountering the same issues. flyctl machine restart my_db_id did perfectly the job. :pray:

3 Likes

ps I had similar troubles. My app toml is just for the web app, so flyctl was getting confused.

Not working
fly pg restart -a my_db_app_name
Error: no active leader found

fly machine restart my_db_id
Error: machine my_db_id was not found in app ‘my_web_app_name’

To make it work I ran
fly machine restart -a my_db_app_name my_db_id
-a overrides the fly.toml app name