Deployment stuck at "Running release task (running)..."

My deployment is stuck at the release task which performs a database migration:

==> Release command
Command: /bin/sh /opt/app/migrate.sh
	 Starting instance
	 Configuring virtual machine
	 Pulling container image
	 Unpacking image
	 Preparing kernel init
	 Configuring firecracker
	 Starting virtual machine
	 Starting init (commit: 7943db6)...
	 Preparing to run: `/bin/sh /opt/app/migrate.sh` as nobody
	 2021/12/17 10:42:22 listening on [fdaa:0:3610:a7b:2809:a51a:e344:2]:22 (DNS: [fdaa::3]:53)
	 Running DB migration release task 💈
	 Reaped child process with pid: 566 and signal: SIGUSR1, core dumped? false
	 10:42:24.390 [info] migrating repos for matchhaus
	 Reaped child process with pid: 568 and signal: SIGUSR1, core dumped? false
Running release task (running)...⣯

The deploy command just hangs at Running release task (running)... and never exits.

The logs indicate that it should be working but the command never succeeds:

2021-12-17T10:42:21.478 runner[a51ae344] lhr [info] Configuring virtual machine
2021-12-17T10:42:21.480 runner[a51ae344] lhr [info] Pulling container image
2021-12-17T10:42:21.771 runner[a51ae344] lhr [info] Unpacking image
2021-12-17T10:42:21.778 runner[a51ae344] lhr [info] Preparing kernel init
2021-12-17T10:42:22.246 runner[a51ae344] lhr [info] Configuring firecracker
2021-12-17T10:42:22.246 runner[a51ae344] lhr [info] Starting virtual machine
2021-12-17T10:42:22.390 app[a51ae344] lhr [info] Starting init (commit: 7943db6)...
2021-12-17T10:42:22.414 app[a51ae344] lhr [info] Preparing to run: `/bin/sh /opt/app/migrate.sh` as nobody
2021-12-17T10:42:22.424 app[a51ae344] lhr [info] 2021/12/17 10:42:22 listening on [fdaa:0:3610:a7b:2809:a51a:e344:2]:22 (DNS: [fdaa::3]:53)
2021-12-17T10:42:22.430 app[a51ae344] lhr [info] Running DB migration release task 💈
2021-12-17T10:42:23.422 app[a51ae344] lhr [info] Reaped child process with pid: 566 and signal: SIGUSR1, core dumped? false
2021-12-17T10:42:24.394 app[a51ae344] lhr [info] 10:42:24.390 [info] migrating repos for matchhaus
2021-12-17T10:42:24.424 app[a51ae344] lhr [info] Reaped child process with pid: 568 and signal: SIGUSR1, core dumped? false

In this case the migration command should be a no-op as there’re no migrations to perform.

It’s strange that it’s not working as I have a staging instance that shares the same postgres instance (different db) and that deployed with no problem

Any ideas on how to debug this?

It looks like the command hung for about 30 min and then crashed. These are the rest of the logs:

Reaped child process with pid: 568 and signal: SIGUSR1, core dumped? false
** (DBConnection.ConnectionError) tcp recv: closed
    (ecto_sql 3.5.3) lib/ecto/adapters/sql.ex:751: Ecto.Adapters.SQL.raise_sql_call_error/1
    (ecto_sql 3.5.3) lib/ecto/adapters/sql.ex:684: Ecto.Adapters.SQL.execute/5
    (ecto 3.5.8) lib/ecto/repo/queryable.ex:229: Ecto.Repo.Queryable.execute/4
    (ecto 3.5.8) lib/ecto/repo/queryable.ex:17: Ecto.Repo.Queryable.all/3
    (ecto_sql 3.5.3) lib/ecto/migrator.ex:514: anonymous fn/6 in Ecto.Migrator.lock_for_migrations/4
    (ecto_sql 3.5.3) lib/ecto/adapters/sql.ex:1027: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4
    (db_connection 2.3.0) lib/db_connection.ex:1426: DBConnection.run_transaction/4
    (ecto_sql 3.5.3) lib/ecto/adapters/sql.ex:854: Ecto.Adapters.SQL.lock_for_migrations/5
Main child exited normally with code: 1
Reaped child process with pid: 570 and signal: SIGUSR1, core dumped? false
Starting clean up.

I’m not sure what actually went wrong there, maybe a deadlock in the database?

It could be. There was previously a migration that failed due to existing data violating an index i the migration. I’ve since removed the offending migration as it wasn’t needed. It failed anyway so I figure it wouldn’t cause a problem.

Any ideas how to debug this?

I’d disable the release_command, deploy your code, and then fly ssh console in and run the migration manually. A 30 min hang is hard to debug without a shell.

OK I’m pretty sure it was a DB lock problem from a previous deploy that timed out.

I couldn’t seem to kill the processes with pg_cancel_backend and pg_terminate_backend however I solved it with a brutal fly restart of the database.

Might have caused a brief moment of downtime and potential data loss but I’m OK with that!