Possible registry outages?

Getting unhealthy deployment errors and when I check the instance logs I see the following:

2021-11-10T00:11:46.824 runner[a326d5ac] lhr [info] Starting instance
2021-11-10T00:11:46.853 runner[a326d5ac] lhr [info] Configuring virtual machine
2021-11-10T00:11:46.855 runner[a326d5ac] lhr [info] Pulling container image
2021-11-10T00:13:46.903 runner[a326d5ac] lhr [info] Pull failed, retrying (attempt #0)
2021-11-10T00:15:46.932 runner[a326d5ac] lhr [info] Pull failed, retrying (attempt #1)
2021-11-10T00:17:46.962 runner[a326d5ac] lhr [info] Pull failed, retrying (attempt #2)
2021-11-10T00:17:46.962 runner[a326d5ac] lhr [info] Pulling image failed

Sorry for the trouble. We’re looking into a potential issue with our Docker registries. We’ll report back as soon as we have more information.

Can you try again now? We’ve fixed the problem with the registry.

Still experiencing the issue

2021-11-10T01:20:26.467 runner[8d92de86] syd [info] Starting instance
2021-11-10T01:20:26.490 runner[8d92de86] syd [info] Configuring virtual machine
2021-11-10T01:20:26.491 runner[8d92de86] syd [info] Pulling container image
2021-11-10T01:22:26.683 runner[8d92de86] syd [info] Pull failed, retrying (attempt #0)
2021-11-10T01:24:26.786 runner[8d92de86] syd [info] Pull failed, retrying (attempt #1)
2021-11-10T01:26:26.890 runner[8d92de86] syd [info] Pull failed, retrying (attempt #2)
2021-11-10T01:26:26.890 runner[8d92de86] syd [info] Pulling image failed

The first node in the deployment started successfully but then the rest experienced the issue, making it worse due to some of the previous nodes being shutdown and the rollback unable to restore nodes.

We’re looking at this now. It may have cleared up (possibly a network issue outside of North America).

Please do try again.

I’m still seeing this in the dfw region:

flyctl --app foo logs
2021-11-10T02:17:20.743 runner[13dffc63] dfw [info] Starting instance
2021-11-10T02:17:20.771 runner[13dffc63] dfw [info] Configuring virtual machine
2021-11-10T02:17:20.772 runner[13dffc63] dfw [info] Pulling container image
2021-11-10T02:19:21.004 runner[13dffc63] dfw [info] Pull failed, retrying (attempt #0)
2021-11-10T02:21:21.145 runner[13dffc63] dfw [info] Pull failed, retrying (attempt #1)
2021-11-10T02:23:21.219 runner[13dffc63] dfw [info] Pull failed, retrying (attempt #2)
2021-11-10T02:23:21.219 runner[13dffc63] dfw [info] Pulling image failed

Still occurring

2021-11-10T02:27:43.222 runner[5745c593] lax [info] Starting instance
2021-11-10T02:27:43.290 runner[5745c593] lax [info] Configuring virtual machine
2021-11-10T02:27:43.292 runner[5745c593] lax [info] Pulling container image
2021-11-10T02:28:43.307 runner[5745c593] lax [info] Pull failed, retrying (attempt #0)
2021-11-10T02:29:43.321 runner[5745c593] lax [info] Pull failed, retrying (attempt #1)
2021-11-10T02:30:43.332 runner[5745c593] lax [info] Pull failed, retrying (attempt #2)
2021-11-10T02:30:43.332 runner[5745c593] lax [info] Pulling image failed

Did that rollback a deploy or is the version you want running now?

If pull issues are breaking deploys (these are somewhat intermittent), you can get going by disabling auto rollback:

[experimental]
  auto_rollback = false

If a VM fails during deploys, it leaves the rest in place. You can then run fly status to see outdated VMs, and try stopping them one by one with fly vm stop <id> to get them updated to the newest version.

It seems to be fixed now, managed to deploy without issues.

Ok well I guess the trick is not to say “hey it’s fixed, try again”. :confused:

We’re still monitoring for these kind of errors. Feel free to post if you hit another.