fly.io is offline - cannot proxy http request

mikeglazer · February 9, 2023, 4:14pm

Seeing these in our logs:

{"env":"production","event":{"provider":"proxy"},"fly":{"app":{"instance":"1c64e48a","name":"qw-agent-api-prod"},"region":"iad"},"host":"2035","level":"warn","message":"Could not proxy HTTP request. Retrying in 1000 ms (attempt 30)"}

Seems to be same symptoms as:

lpil · February 9, 2023, 4:20pm

This just hit me too, and on launch day right after I sent out the invite emails Awful luck

Although the error I’m seeing is error.message="could not find an instance to route to"

YungTarps · February 9, 2023, 4:22pm

Yeah… seems like fly.io → fly.io requests work (my environment is up, frontend can access the API). Just that anyone on the outside can’t connect.

mcfadyeni · February 9, 2023, 4:39pm

Dash is also pretty laggy (timeouts and errors), perhaps related since I’m guessing the dash is on Fly itself? What’s everyone’s region they having troubles with (YYZ for me)?

lpil · February 9, 2023, 4:42pm

I’m in lhr, so possibly it isn’t a regional problem.

Mine has just come back online, phew!

edit: Spoke too soon, I’m back offline.

catflydotio · February 9, 2023, 4:48pm

Looks like Fly.io Status - Delayed state causing proxying / deployment issues

jmill · February 9, 2023, 6:26pm

Seeing the same as of a few minutes ago, our app locations are only in dfw:

2023-02-09T18:20:56.906 proxy[1c04938d] iad [warn] Could not proxy HTTP request. Retrying in 1000 ms (attempt 30)
2023-02-09T18:20:57.927 proxy[0967c4e8] iad [warn] Could not proxy HTTP request. Retrying in 1000 ms (attempt 40)
2023-02-09T18:20:59.077 proxy[0967c4e8] chi [warn] Could not proxy HTTP request. Retrying in 1000 ms (attempt 30)
2023-02-09T18:20:59.502 proxy[0967c4e8] lga [warn] Could not proxy HTTP request. Retrying in 947 ms (attempt 60)
2023-02-09T18:21:00.137 proxy[1c04938d] lga [warn] Could not proxy HTTP request. Retrying in 1000 ms (attempt 60)
2023-02-09T18:21:01.081 proxy[0967c4e8] iad [warn] Could not proxy HTTP request. Retrying in 1000 ms (attempt 40)
2023-02-09T18:21:02.122 proxy[0967c4e8] iad [warn] Could not proxy HTTP request. Retrying in 1000 ms (attempt 40)
2023-02-09T18:21:02.548 proxy[0967c4e8] iad [warn] Could not proxy HTTP request. Retrying in 1000 ms (attempt 60)

Edit: now resolved.

I’m surprised we did not receive a status page notification from status.flyio.net about this, as we are subscribed to all Proxy-related issues.

Andrew2 · February 9, 2023, 7:25pm

This has been happening on and off all day for me as well. Usually right after a deployment. It happened the other week as well. Pretty frustrating.

boiserunner · February 9, 2023, 7:47pm

Here too. What’s interesting is that it says it cannot proxy HTTP in a region I’m not actually in. This is pretty disappointing.

marcelodasilva · February 11, 2023, 12:10pm

Yep, confirmed it is still happening to me. Even they saying this was fixed (https://status.flyio.net/).

EDIT: it seems to be an intermittent issue… now I am able to access my website.

jerome · February 11, 2023, 12:48pm

This particular error can happen for a variety of reasons. It may not be related to the now-resolved incident.

We’ll need more details if you want us to troubleshoot your issue.

Our logs presentation don’t do a great job of showing why we couldn’t proxy an http request. That’s something we need to fix.

marcelodasilva · February 11, 2023, 12:49pm

Understood @jerome . Thanks for the explanation. Right now I was able to access it. If it keeps happening I send more details here then!

getkey · February 11, 2023, 1:20pm

I am also affected, as I mentioned here yesterday. It happened again today at 11:51 UTC.

Here is an excerpt from my logs.

2023-02-11T13:06:30.197 proxy[972c676b] ams [error] timed out while connecting to instance
2023-02-11T13:06:31.140 proxy[972c676b] ams [error] timed out while connecting to instance
2023-02-11T13:06:31.296 proxy[972c676b] ams [error] timed out while connecting to instance
2023-02-11T13:06:33.302 proxy[972c676b] ams [error] timed out while connecting to instance
2023-02-11T13:06:33.393 proxy[972c676b] ams [error] timed out while connecting to instance
2023-02-11T13:06:34.147 proxy[972c676b] ams [error] timed out while connecting to instance
2023-02-11T13:06:36.331 proxy[972c676b] ams [error] timed out while connecting to instance
2023-02-11T13:06:37.154 proxy[972c676b] ams [error] timed out while connecting to instance
2023-02-11T13:06:37.341 proxy[972c676b] arn [warn] Could not proxy HTTP request. Retrying in 1000 ms (attempt 10)
2023-02-11T13:06:39.359 proxy[972c676b] ams [error] timed out while connecting to instance
2023-02-11T13:06:40.163 proxy[972c676b] ams [error] timed out while connecting to instance
2023-02-11T13:06:42.386 proxy[972c676b] ams [error] timed out while connecting to instance
2023-02-11T13:06:43.202 proxy[972c676b] ams [error] timed out while connecting to instance
2023-02-11T13:06:44.205 proxy[972c676b] ams [warn] Could not proxy HTTP request. Retrying in 1000 ms (attempt 40)

The issue goes away temporarily after I flyctl vm restart my instances.

jerome · February 11, 2023, 1:35pm

A timeout connecting to your instance means your app is not accepting our connections within a 2 seconds delay. That’s usually a symptom of something blocking your accept loop.

It largely depends on the kind of app you’re running, but that error is always an indication that something is wrong with the app.

getkey · February 11, 2023, 4:18pm

That makes sense. This app was getting DOSed, so I added a rate-limit that hangs connections. I’m surprised it would block the accept loop, because the hanging happens in async tasks, but I might have overlooked something. I know where to look now, thanks!

Up to how many retries will the proxy do? And is the delay between attempts always 1 second? It wouldn’t be good if it amplified attacks.

jerome · February 11, 2023, 4:47pm

Ah glad you figured it out!

Sometimes it just puts enough stress on the app that there’s lag in processing tasks, resulting in the accept loop taking too much time to accept.

It will retry up to 90 times, exponentially backing off up to 1s with some jittering. Looks like the jitter might not be working right when the max of our backoff iterator has been reached.

It starts at 20ms, so it reaches 1s pretty fast.

One thing to understand is that the backoff usually happens at the edge, not on the worker node hosting your app. We only retry some connection errors on the workers when it’s safe to do so. Connection timeouts are not part of that. Usually connection timeouts will trigger a “passive health check failure” in our proxy, preventing us from sending more connections to a specific instance (unless we’ve exhausted all other possible instances, so if you only have 1, we’ll keep retrying).

ignoramous · February 11, 2023, 5:13pm

Is this also true for Apps v2 / Machines?

Just curious: A few examples for safe to retry?

jerome · February 11, 2023, 5:27pm

Yes, from the proxy’s perspective apps v1, machines and apps v2 are just “services”.

Looking into it more: I was wrong. We don’t retry any connection errors. We only retry some machine API operations calls.

We used to retry some connection errors as long as they did not succeed initially (we did not retry any connection that was established properly and then failed due to a broken pipe or connection reset, for example) and only if they didn’t indicate the instance would never recover (connection refused => instance won’t recover from that).

That said, for HTTP requests, we do send back some useful information about the retryability of a request (either to the same instance or another instance w/ a matching routing config for the same app). The gist of it is: an unsuccessful initial connection can be retried. Other cases unrelated to connections include stuff like “instance not found” or any other cases where we didn’t even try to connect.

pyk · February 13, 2023, 11:49pm

This still happen to my app even tho they are responding to the health check.

error.message="could not find an instance to route to" 2023-02-13T23:46:08Z proxy[bf35d603] sin [warn]request.method="POST" request.url="/v1" request.id="01GS6JQNG7B0YM8BFG396Y1DXP-sin"
2023-02-13T23:46:16Z app[bf35d603] sin [info]::ffff:172.19.68.137 - - [13/Feb/2023:23:46:16 +0000] "GET /health HTTP/1.1" 200 2 "-" "Consul Health Check"
2023-02-13T23:46:26Z app[bf35d603] sin [info]::ffff:172.19.68.137 - - [13/Feb/2023:23:46:26 +0000] "GET /health HTTP/1.1" 200 2 "-" "Consul Health Check"
2023-02-13T23:46:36Z app[bf35d603] sin [info]::ffff:172.19.68.137 - - [13/Feb/2023:23:46:36 +0000] "GET /health HTTP/1.1" 200 2 "-" "Consul Health Check"
2023-02-13T23:46:46Z app[bf35d603] sin [info]::ffff:172.19.68.137 - - [13/Feb/2023:23:46:46 +0000] "GET /health HTTP/1.1" 200 2 "-" "Consul Health Check"
2023-02-13T23:46:56Z app[bf35d603] sin [info]::ffff:172.19.68.137 - - [13/Feb/2023:23:46:56 +0000] "GET /health HTTP/1.1" 200 2 "-" "Consul Health Check"
2023-02-13T23:47:06Z app[bf35d603] sin [info]::ffff:172.19.68.137 - - [13/Feb/2023:23:47:06 +0000] "GET /health HTTP/1.1" 200 2 "-" "Consul Health Check"
error.message="could not find an instance to route to" 2023-02-13T23:47:08Z proxy[bf35d603] sin [warn]request.method="POST" request.url="/v1" request.id="01GS6JSGKHEYEXTEP35NDQD457-sin"
2023-02-13T23:47:16Z app[bf35d603] sin [info]::ffff:172.19.68.137 - - [13/Feb/2023:23:47:16 +0000] "GET /health HTTP/1.1" 200 2 "-" "Consul Health Check"
2023-02-13T23:47:26Z app[bf35d603] sin [info]::ffff:172.19.68.137 - - [13/Feb/2023:23:47:26 +0000] "GET /health HTTP/1.1" 200 2 "-" "Consul Health Check"
2023-02-13T23:47:36Z app[bf35d603] sin [info]::ffff:172.19.68.137 - - [13/Feb/2023:23:47:36 +0000] "GET /health HTTP/1.1" 200 2 "-" "Consul Health Check"
2023-02-13T23:47:46Z app[bf35d603] sin [info]::ffff:172.19.68.137 - - [13/Feb/2023:23:47:46 +0000] "GET /health HTTP/1.1" 200 2 "-" "Consul Health Check"
2023-02-13T23:47:57Z app[bf35d603] sin [info]::ffff:172.19.68.137 - - [13/Feb/2023:23:47:57 +0000] "GET /health HTTP/1.1" 200 2 "-" "Consul Health Check"
2023-02-13T23:48:07Z app[bf35d603] sin [info]::ffff:172.19.68.137 - - [13/Feb/2023:23:48:07 +0000] "GET /health HTTP/1.1" 200 2 "-" "Consul Health Check"
error.message="could not find an instance to route to" 2023-02-13T23:48:08Z proxy[bf35d603] sin [warn]request.method="POST" request.url="/v1" request.id="01GS6JVB3N92E8Y83ZEKVCKNAW-sin"

Is there any way to fix this? Is adding more region solve this problem?

thank you!

Edit: I have scale my app using fly scale count 2 -a APP and solved the error

tello · February 21, 2023, 9:58pm

This is happening to me today. @jerome Is there something we can do on our end to hot fix it?

Topic		Replies	Views
Something not right on Fly.io	35	1896	March 4, 2023
FLy status shows up but app is down for seven hours	9	833	March 21, 2023
timed out while connecting to instance	3	333	March 11, 2023
Fly.io Dashboard & Docs not working	8	334	September 20, 2022
could not make HTTP request to instance	2	462	August 4, 2023

fly.io is offline - cannot proxy http request

Related topics