Service crashed

Hey, one of my services crashed tonight for a few hours and the logs are not really helpful. We didn’t do any deployments today. I am just wondering what went wrong so this does not happen again in the future.

2021-09-22T13:38:01.797483351Z app[49b3f31e] lhr [info] Starting clean up.
2021-09-22T13:55:23.575983189Z runner[b9f44408] fra [info] Starting instance
2021-09-22T13:55:23.589547526Z runner[b9f44408] fra [info] Configuring virtual machine
2021-09-22T13:55:23.590396848Z runner[b9f44408] fra [info] Pulling container image
2021-09-22T13:55:23.615070522Z runner[b9f44408] fra [info] Pull failed, retrying (attempt #0)
2021-09-22T13:55:23.615907591Z runner[b9f44408] fra [info] Pull failed, retrying (attempt #1)
2021-09-22T13:55:23.616668486Z runner[b9f44408] fra [info] Pull failed, retrying (attempt #2)
2021-09-22T13:55:23.616669899Z runner[b9f44408] fra [info] Pulling image failed
2021-09-22T13:55:39.227305306Z runner[26c46279] ams [info] Shutting down virtual machine
2021-09-22T13:55:39.443965009Z app[26c46279] ams [info] Sending signal SIGINT to main child process w/ PID 508
2021-09-22T13:55:40.102582083Z runner[6d6f60e3] ams [info] Starting instance
2021-09-22T13:55:40.116137573Z runner[6d6f60e3] ams [info] Configuring virtual machine
2021-09-22T13:55:40.117060805Z runner[6d6f60e3] ams [info] Pulling container image
2021-09-22T13:55:40.134440903Z runner[6d6f60e3] ams [info] Pull failed, retrying (attempt #0)
2021-09-22T13:55:40.136173963Z runner[6d6f60e3] ams [info] Pull failed, retrying (attempt #1)
2021-09-22T13:55:40.137045207Z runner[6d6f60e3] ams [info] Pull failed, retrying (attempt #2)
2021-09-22T13:55:40.137046059Z runner[6d6f60e3] ams [info] Pulling image failed
2021-09-22T13:55:40.448812017Z app[26c46279] ams [info] Main child exited with signal (with signal 'SIGINT', core dumped? false)
2021-09-22T13:55:40.449528204Z app[26c46279] ams [info] Starting clean up.
2021-09-22T23:17:20.486944842Z runner[bcd74819] ams [info] Starting instance
2021-09-22T23:17:20.501424094Z runner[bcd74819] ams [info] Configuring virtual machine
2021-09-22T23:17:20.502273788Z runner[bcd74819] ams [info] Pulling container image
2021-09-22T23:17:33.131663622Z runner[bcd74819] ams [info] Unpacking image
2021-09-22T23:17:42.120792114Z runner[bcd74819] ams [info] Preparing kernel init
2021-09-22T23:17:42.487085407Z runner[bcd74819] ams [info] Configuring firecracker
2021-09-22T23:17:42.594612151Z runner[bcd74819] ams [info] Starting virtual machine
2021-09-22T23:17:42.698575518Z app[bcd74819] ams [info] Starting init (commit: 50ffe20)...
2021-09-22T23:17:42.713231645Z app[bcd74819] ams [info] Preparing to run: ` yarn start` as root
2021-09-22T23:17:42.725954832Z app[bcd74819] ams [info] 2021/09/22 23:17:42 listening on [fdaa:0:137a:a7b:23c2:bcd7:4819:2]:22 (DNS: [fdaa::3]:53)

On the dashboard, the status was set to dead. After redeploying manually the service from my local machine (with the same code as before) everything is working again. The service was running without problems for a few months already.
Did you had an issue on your side at this time?

Those image pull failures are super weird, we’re investigating this to see what might have happened.

So we figured out what happened, it was a Docker registry issue that prevented us pulling images on a few apps. Normally we would keep your old VMs around even if new ones could boot, but your app scales somewhat frequently. When an app scales, it stops vms that are >1 version out of date. This is poor, it should not behave this way.

We are working on a longer term project that will solve many of these problems. For now, we’ve added really aggressive monitoring to the registry that should catch these kinds of errors if they happen again.

1 Like