Deployment stuck at "Running release task (running)..."

harg · December 17, 2021, 10:47am

My deployment is stuck at the release task which performs a database migration:

==> Release command
Command: /bin/sh /opt/app/migrate.sh
	 Starting instance
	 Configuring virtual machine
	 Pulling container image
	 Unpacking image
	 Preparing kernel init
	 Configuring firecracker
	 Starting virtual machine
	 Starting init (commit: 7943db6)...
	 Preparing to run: `/bin/sh /opt/app/migrate.sh` as nobody
	 2021/12/17 10:42:22 listening on [fdaa:0:3610:a7b:2809:a51a:e344:2]:22 (DNS: [fdaa::3]:53)
	 Running DB migration release task 💈
	 Reaped child process with pid: 566 and signal: SIGUSR1, core dumped? false
	 10:42:24.390 [info] migrating repos for matchhaus
	 Reaped child process with pid: 568 and signal: SIGUSR1, core dumped? false
Running release task (running)...⣯

The deploy command just hangs at Running release task (running)... and never exits.

The logs indicate that it should be working but the command never succeeds:

2021-12-17T10:42:21.478 runner[a51ae344] lhr [info] Configuring virtual machine
2021-12-17T10:42:21.480 runner[a51ae344] lhr [info] Pulling container image
2021-12-17T10:42:21.771 runner[a51ae344] lhr [info] Unpacking image
2021-12-17T10:42:21.778 runner[a51ae344] lhr [info] Preparing kernel init
2021-12-17T10:42:22.246 runner[a51ae344] lhr [info] Configuring firecracker
2021-12-17T10:42:22.246 runner[a51ae344] lhr [info] Starting virtual machine
2021-12-17T10:42:22.390 app[a51ae344] lhr [info] Starting init (commit: 7943db6)...
2021-12-17T10:42:22.414 app[a51ae344] lhr [info] Preparing to run: `/bin/sh /opt/app/migrate.sh` as nobody
2021-12-17T10:42:22.424 app[a51ae344] lhr [info] 2021/12/17 10:42:22 listening on [fdaa:0:3610:a7b:2809:a51a:e344:2]:22 (DNS: [fdaa::3]:53)
2021-12-17T10:42:22.430 app[a51ae344] lhr [info] Running DB migration release task 💈
2021-12-17T10:42:23.422 app[a51ae344] lhr [info] Reaped child process with pid: 566 and signal: SIGUSR1, core dumped? false
2021-12-17T10:42:24.394 app[a51ae344] lhr [info] 10:42:24.390 [info] migrating repos for matchhaus
2021-12-17T10:42:24.424 app[a51ae344] lhr [info] Reaped child process with pid: 568 and signal: SIGUSR1, core dumped? false

In this case the migration command should be a no-op as there’re no migrations to perform.

It’s strange that it’s not working as I have a staging instance that shares the same postgres instance (different db) and that deployed with no problem

Any ideas on how to debug this?

kurt · December 17, 2021, 3:07pm

It looks like the command hung for about 30 min and then crashed. These are the rest of the logs:

Reaped child process with pid: 568 and signal: SIGUSR1, core dumped? false
** (DBConnection.ConnectionError) tcp recv: closed
    (ecto_sql 3.5.3) lib/ecto/adapters/sql.ex:751: Ecto.Adapters.SQL.raise_sql_call_error/1
    (ecto_sql 3.5.3) lib/ecto/adapters/sql.ex:684: Ecto.Adapters.SQL.execute/5
    (ecto 3.5.8) lib/ecto/repo/queryable.ex:229: Ecto.Repo.Queryable.execute/4
    (ecto 3.5.8) lib/ecto/repo/queryable.ex:17: Ecto.Repo.Queryable.all/3
    (ecto_sql 3.5.3) lib/ecto/migrator.ex:514: anonymous fn/6 in Ecto.Migrator.lock_for_migrations/4
    (ecto_sql 3.5.3) lib/ecto/adapters/sql.ex:1027: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4
    (db_connection 2.3.0) lib/db_connection.ex:1426: DBConnection.run_transaction/4
    (ecto_sql 3.5.3) lib/ecto/adapters/sql.ex:854: Ecto.Adapters.SQL.lock_for_migrations/5
Main child exited normally with code: 1
Reaped child process with pid: 570 and signal: SIGUSR1, core dumped? false
Starting clean up.

I’m not sure what actually went wrong there, maybe a deadlock in the database?

harg · December 17, 2021, 3:56pm

It could be. There was previously a migration that failed due to existing data violating an index i the migration. I’ve since removed the offending migration as it wasn’t needed. It failed anyway so I figure it wouldn’t cause a problem.

Any ideas how to debug this?

kurt · December 17, 2021, 3:57pm

I’d disable the release_command, deploy your code, and then fly ssh console in and run the migration manually. A 30 min hang is hard to debug without a shell.

harg · December 17, 2021, 4:26pm

OK I’m pretty sure it was a DB lock problem from a previous deploy that timed out.

I couldn’t seem to kill the processes with pg_cancel_backend and pg_terminate_backend however I solved it with a brutal fly restart of the database.

Might have caused a brief moment of downtime and potential data loss but I’m OK with that!

Topic		Replies	Views
Deployment stuck at "Running release task (pending)..."	1	409	July 21, 2021
Deployment stuck at Running release task (pending) Build debugging	7	1028	November 2, 2021
Stuck release command	3	459	September 23, 2022
Stuck on Running release task (pending)... Build debugging	4	700	September 13, 2022
Django app stuck on "Running release task (pending)" Django	3	349	February 4, 2023

Deployment stuck at "Running release task (running)..."

Related topics