My apps are gone: Could not resolve App

redrabbit · August 26, 2021, 3:36pm

Hi there.

I was experimenting a bit with --build-arg and environment variables (see Setting environment variables in fly.toml and on the CLI when deploying) when suddenly all my apps got destroyed.

I had two apps:

git-limo (running Elixir code)
git-limo-db (Postgres cluster)

Both apps disappeared from my personal account. Not able to find them via CLI or web interface .

The last command I ran was flyctl deploy --build-arg APP_REVISION=$(git rev-parse --short HEAD).
It did return successfully without any errors.

jerome · August 26, 2021, 3:45pm

Hello, this is likely due to: Fly.io Status - Apps with volumes deployed to one of our FRA hosts are temporarily unavailable

We’re currently resolving the issue. It shouldn’t be too long. Very sorry about that.

redrabbit · August 26, 2021, 3:49pm

Alright! I did check status.flyio.net right after discovering my apps where gone without seeing the notification. Bad timing I guess.

No worries at all and thanks for the fast response

jerome · August 26, 2021, 3:53pm

We were still working out what happened at the time , we’ve started recovering apps and volumes now.

vicente · August 26, 2021, 4:21pm

Not sure if this is related, but some of my instances in LAX/SCJ are gone as well.

(edited) I’m using the latest fly cli, and scaling by process type seems not to be taking effect either:

flyctl scale count worker=4 web=1 ...

Let me know if you need more details. Here’s some traces I found interesting:

2021-08-26T16:15:16.006546548Z app[810069d1] sjc [info] I, [2021-08-26T16:15:16.004636 #510]  INFO -- : [aaa9d3b7-54d6-414f-a833-f16cde4990f2] Started GET "/sidekiq/busy" for 213.188.195.106 at 2021-08-26 16:15:16 +0000
2021-08-26T16:15:16.419634386Z runner[a23a8990] sjc [info] Pull failed, retrying (attempt #0)
2021-08-26T16:15:16.940620061Z runner[a23a8990] sjc [info] Pull failed, retrying (attempt #1)
2021-08-26T16:15:17.102324699Z runner[a23a8990] sjc [info] Pull failed, retrying (attempt #2)
2021-08-26T16:15:17.102327744Z runner[a23a8990] sjc [info] Pulling image failed

Deployments as well get stuck pushing the image to fly.

--> Building image done
==> Pushing image to fly
The push refers to repository [registry.fly.io/web-staging]
5ee8a1862a5d: Retrying in 2 seconds
d644e28b0154: Retrying in 2 seconds
888ed16fa8d4: Retrying in 1 second
80c1258fbf48: Retrying in 20 seconds
d2326890c315: Retrying in 20 seconds
c1d4ea37a38e: Waiting
6b0a86ff36bc: Waiting
cf1658786d7f: Waiting
0a591661a0ee: Waiting
5629057d1dc8: Waiting
f6301628e67e: Waiting
7555a8182c42: Waiting

jerome · August 26, 2021, 4:39pm

Our registry and therefore all pushes to us was also affected. I forgot to tick that box on the status page incident.

Did your app have volumes and did it have one in FRA? There’s a chance it was affected if it was on the unlucky host.

mo.rajbi · August 26, 2021, 4:40pm

My app is gone too. This is our production server. Is there an ETA when it’ll be up?
Or I should deploy again as a new app?
I don’t need volumes in FRA.

vicente · August 26, 2021, 4:43pm

No, all apps and volumes on the US West coast. Mostly LAX, SJC, and SEA. Instances are back, although scaling processes still takes no effect.

Thank you for mitigating this so quick!

jerome · August 26, 2021, 4:48pm

It would appear you had a volume in FRA and so your app was affected. Can you try deploying, but not as a new app? You can make a new app if you want to replace the old one, but we will still restore the old one.

jerome · August 26, 2021, 4:50pm

@redrabbit can you try redeploying this app?

gikappa · August 26, 2021, 4:54pm

I had a Postgres app (created with fly postgres) with a volume in FRA and it’s still gone (can’t find it with fly apps list, can’t find it with fly postgres list). Should I just keep waiting?

Edit: ah nevermind, it just came back.

Edit again: the app is now shown with fly apps list but not with fly postgres list…

kurt · August 26, 2021, 5:35pm

@gikappa I think you posted right as we were bringing the postgres clusters back up. We just fixed the fly postgres list issue as well, so you should see it there now.

redrabbit · August 26, 2021, 5:36pm

Hi @jerome.

Both apps are running again. When I try to redeploy git-limo, I get following error:

--> Building image done
==> Pushing image to fly
The push refers to repository [registry.fly.io/git-limo]
86474f501502: Layer already exists 
3443085fac84: Layer already exists 
320eeb96f1f8: Layer already exists 
e27db0ec1ca9: Layer already exists 
a228ad1d8b42: Layer already exists 
275bcf520bc2: Layer already exists 
e6648a72d01b: Layer already exists 
afbc0a5c0cbf: Layer already exists 
2a70d45d4f05: Layer already exists 
30616763fb70: Layer already exists 
72e830a4dff5: Layer already exists 
deployment-1629999127: digest: sha256:7c3f09ac7d1030675432a7d6504044a45c9ffd7fcef2a6e327b0d282f31bc5a3 size: 2620
--> Pushing image done
Image: registry.fly.io/git-limo:deployment-1629999127
Image size: 130 MB
==> Creating release

Error An unknown error occured.

But somehow the app still gets deployed. I can ssh into my instance and check for the right APP_REVISION…

kurt · August 26, 2021, 5:38pm

@redrabbit we might have just fixed that error. Will you try another deploy and see? We removed some now-stale database records that could potentially conflict with new deploys.

redrabbit · August 26, 2021, 5:43pm

Ok, just redeployed and voilà:

-> Pushing image done
Image: registry.fly.io/git-limo:deployment-1629999617
Image size: 130 MB
==> Creating release
Release v4 created

You can detach the terminal anytime without stopping the deployment
Monitoring Deployment

1 desired, 1 placed, 1 healthy, 0 unhealthy
--> v4 deployed successfully

kurt · August 26, 2021, 5:45pm

Phew, very sorry about that. It was my fault, too, so extra thanks for being patient.

mo.rajbi · August 26, 2021, 6:20pm

Our app is up after redeploy. Thanks.