App still unavailable since FRA incident on August 26 (redeploy always fails)

Hey, thanks for great work on!

On August 26, there has been an incident concerning apps with volumes deployed to one of your FRA hosts ( entry). My Django app ( was affected. Since the incident took place, I could not get my app back online again (app is not reachable, health check fails, app is marked as “dead”).

I have tried to redeploy my app multiple times (now on v13 since the incident), but the deploy is always failing.

Here are the respective logs:

2021-08-28T12:09:49.512654639Z runner[86c2746d] fra [info] Starting instance
2021-08-28T12:09:49.554131034Z runner[86c2746d] fra [info] Configuring virtual machine
2021-08-28T12:09:49.559507494Z runner[86c2746d] fra [info] Pulling container image
2021-08-28T12:09:52.127091157Z runner[86c2746d] fra [info] Unpacking image
2021-08-28T12:09:52.137573237Z runner[86c2746d] fra [info] Preparing kernel init
2021-08-28T12:09:52.328857454Z runner[86c2746d] fra [info] Setting up volume 'persistent_files'
2021-08-28T12:09:52.334360192Z runner[86c2746d] fra [info] Opening encrypted volume
2021-08-28T12:09:52.595783531Z runner[86c2746d] fra [info] Configuring firecracker
2021-08-28T12:09:52.698982147Z runner[86c2746d] fra [info] Starting virtual machine
2021-08-28T12:09:52.871723338Z app[86c2746d] fra [info] Starting init (commit: 721b5c7)...
2021-08-28T12:09:52.897994199Z app[86c2746d] fra [info] Mounting /dev/vdc at /root/persistent_files
2021-08-28T12:09:52.916053317Z app[86c2746d] fra [info] Running: `bash /root/code/` as root
2021-08-28T12:09:52.940046811Z app[86c2746d] fra [info] 2021/08/28 12:09:52 listening on [fdaa:0:20a5:a7b:67:0:2072:2]:22 (DNS: [fdaa::3]:53)
2021-08-28T12:06:54.366244024Z app[86c2746d] fra [info] Operations to perform:
2021-08-28T12:06:54.367370278Z app[86c2746d] fra [info]   Apply all migrations: admin, analytics, auth, contenttypes, counterpage, landingpage, sessions
2021-08-28T12:06:54.376471007Z app[86c2746d] fra [info] Running migrations:
2021-08-28T12:06:54.376787395Z app[86c2746d] fra [info]   No migrations to apply.
2021-08-28T12:06:56.969626748Z app[86c2746d] fra [info] [2021-08-28 12:06:56 +0000] [530] [INFO] Starting gunicorn 20.1.0
2021-08-28T12:06:56.971087050Z app[86c2746d] fra [info] [2021-08-28 12:06:56 +0000] [530] [INFO] Listening at: (530)
2021-08-28T12:06:56.971946505Z app[86c2746d] fra [info] [2021-08-28 12:06:56 +0000] [530] [INFO] Using worker: sync
2021-08-28T12:06:56.977224805Z app[86c2746d] fra [info] [2021-08-28 12:06:56 +0000] [532] [INFO] Booting worker with pid: 532
2021-08-28T12:06:56.994192069Z app[86c2746d] fra [info] [2021-08-28 12:06:56 +0000] [533] [INFO] Booting worker with pid: 533
2021-08-28T12:06:57.040990254Z app[86c2746d] fra [info] [2021-08-28 12:06:57 +0000] [534] [INFO] Booting worker with pid: 534

As you can see, the Gunicorn workers are starting & listening on port 8000, but the app is not reachable from outside.

I have rebuilt & run the Docker image locally - locally, the app is reachable.

Is this an error from my side or are there ongoing issues with apps using FRA volumes? As you can imagine, this is very annoying & I feel like there is not much more I can do :frowning_face:

Thanks in advance for you help!

I just took a look. It appears your app is unhealthy because there’s a TCP check on port 8080, but you’re listening on port 8000 (and exposing port 8000). The check seems to be on the wrong port. Maybe it’s set explicitly in the checks for your service? If you remove it it should use port 8000 automatically.

What does your fly.toml [[services]] section look like?

1 Like

Hey Jerome, thanks for the fast reply & suggestion! It worked, thanks a lot! Removing the wrong explicit port number from the [[services.tcp]] section fixed it for me.

This is really strange though - I haven’t touched my fly.toml in a long time (last change was on May 18) & it certainly did work before with the same fly.toml (e.g. on August 11 - this was the last time I deployed my app). Maybe somehow your check did not pick up this explicit port until recently?

Anyways, it works now, so no worries :grinning_face_with_smiling_eyes: Again, thanks a lot for you help! :heart_eyes:

@markusdosch You are right! That port in the check wasn’t working until recently, I think we fixed that bug a week ago.

1 Like