Timeouts

I am receiving a lot of timeouts since yesterday:

The only thing I did was deploying a new release to one of my apps around the same time the timeouts started. But it’s not the actual one causing the timeouts. The one causing the timeouts did not change anything recently.

I have a nginx container running in front of my applications and I restarted it a couple of times to make sure that it’s not trying to contact outdated instances. When I ssh into the instance and CURL my app instances I don’t get any timeouts.

The log’s dont give me anything useful, just that it happens across the 3 regions nginx is deployed and displaying the 499 errors which indicate that the client closed the connection before nginx was able to serve a response from the application servers:

2022-10-11T04:56:52Z app[4f19a8e1] iad [info]45.33.100.21 - - [11/Oct/2022:04:56:52 +0000] "GET https://example.org/api/v1/health/ping HTTP/2.0" 499 0 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 29.505
2022-10-11T04:56:52Z app[3dd9e2c4] ewr [info]23.88.41.31 - - [11/Oct/2022:04:56:52 +0000] "GET https://example.org/api/v1/health/database HTTP/2.0" 200 36 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 0.050
2022-10-11T04:56:53Z app[4f19a8e1] iad [info]172.105.206.169 - - [11/Oct/2022:04:56:53 +0000] "GET https://example.org/api/v1/health/database HTTP/2.0" 200 36 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 0.081
2022-10-11T04:56:53Z app[4f19a8e1] iad [info]45.33.100.21 - - [11/Oct/2022:04:56:53 +0000] "GET https://example.org/api/v1/health/database HTTP/2.0" 200 36 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 0.059
2022-10-11T04:56:55Z app[3dd9e2c4] ewr [info]172.105.173.108 - - [11/Oct/2022:04:56:55 +0000] "GET https://example.org/api/v1/health/database HTTP/2.0" 200 36 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 0.044
2022-10-11T04:56:55Z app[22468dbb] yyz [info]45.79.47.102 - - [11/Oct/2022:04:56:55 +0000] "GET https://example.org/api/v1/health/ping HTTP/2.0" 200 36 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 0.043
2022-10-11T04:56:58Z app[3dd9e2c4] ewr [info]168.119.96.54 - - [11/Oct/2022:04:56:58 +0000] "GET https://example.org/api/v1/health/ping HTTP/2.0" 200 36 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 0.046
2022-10-11T04:56:59Z app[4f19a8e1] iad [info]45.33.100.21 - - [11/Oct/2022:04:56:59 +0000] "GET https://example.org/api/v1/health/ping HTTP/2.0" 200 36 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 0.084
2022-10-11T04:57:00Z app[4f19a8e1] iad [info]74.207.228.249 - - [11/Oct/2022:04:57:00 +0000] "GET https://example.org/api/v1/health/ping HTTP/2.0" 200 36 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 0.057
2022-10-11T04:57:05Z app[4f19a8e1] iad [info]172.104.109.161 - - [11/Oct/2022:04:57:05 +0000] "GET https://example.org/api/v1/health/redis HTTP/2.0" 200 36 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 0.116
2022-10-11T04:57:06Z app[3dd9e2c4] ewr [info]168.119.96.203 - - [11/Oct/2022:04:57:06 +0000] "GET https://example.org/api/v1/health/redis HTTP/2.0" 200 36 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 0.087
2022-10-11T04:57:06Z app[22468dbb] yyz [info]45.56.78.139 - - [11/Oct/2022:04:57:06 +0000] "GET https://example.org/api/v1/health/redis HTTP/2.0" 200 36 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 0.076
2022-10-11T04:57:07Z app[3dd9e2c4] ewr [info]172.105.190.118 - - [11/Oct/2022:04:57:07 +0000] "GET https://example.org/api/v1/health/redis HTTP/2.0" 200 36 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 0.085
2022-10-11T04:57:27Z app[3dd9e2c4] ewr [info]168.119.96.54 - - [11/Oct/2022:04:57:27 +0000] "GET https://example.org/api/v1/health/database HTTP/2.0" 200 36 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 0.044
2022-10-11T04:57:27Z app[3dd9e2c4] ewr [info]23.88.41.31 - - [11/Oct/2022:04:57:27 +0000] "GET https://example.org/api/v1/health/ping HTTP/2.0" 499 0 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 29.957
2022-10-11T04:57:39Z app[4f19a8e1] iad [info]139.162.109.252 - - [11/Oct/2022:04:57:39 +0000] "GET https://example.org/api/v1/health/redis HTTP/2.0" 200 36 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 0.100
2022-10-11T04:57:46Z app[3dd9e2c4] ewr [info]172.105.190.118 - - [11/Oct/2022:04:57:46 +0000] "GET https://example.org/api/v1/health/redis HTTP/2.0" 200 36 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 0.082
2022-10-11T04:57:50Z app[3dd9e2c4] ewr [info]168.119.96.54 - - [11/Oct/2022:04:57:50 +0000] "GET https://example.org/api/v1/health/database HTTP/2.0" 200 36 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 0.044
2022-10-11T04:57:54Z app[4f19a8e1] iad [info]172.105.206.169 - - [11/Oct/2022:04:57:54 +0000] "GET https://example.org/api/v1/health/database HTTP/2.0" 200 36 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 0.058
2022-10-11T04:57:54Z app[22468dbb] yyz [info]45.79.47.102 - - [11/Oct/2022:04:57:54 +0000] "GET https://example.org/api/v1/health/database HTTP/2.0" 200 36 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 0.052
2022-10-11T04:57:56Z app[3dd9e2c4] ewr [info]168.119.96.54 - - [11/Oct/2022:04:57:56 +0000] "GET https://example.org/api/v1/health/ping HTTP/2.0" 499 0 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 29.690
2022-10-11T04:58:06Z app[4f19a8e1] iad [info]45.33.100.21 - - [11/Oct/2022:04:58:06 +0000] "GET https://example.org/api/v1/health/database HTTP/2.0" 499 0 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 29.595
2022-10-11T04:58:07Z app[3dd9e2c4] ewr [info]172.105.169.250 - - [11/Oct/2022:04:58:07 +0000] "GET https://example.org/api/v1/health/database HTTP/2.0" 499 0 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 28.358
2022-10-11T04:58:07Z app[4f19a8e1] iad [info]172.105.206.169 - - [11/Oct/2022:04:58:07 +0000] "GET https://example.org/api/v1/health/database HTTP/2.0" 499 0 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 28.618
2022-10-11T04:58:15Z app[3dd9e2c4] ewr [info]23.88.41.31 - - [11/Oct/2022:04:58:15 +0000] "GET https://example.org/api/v1/health/redis HTTP/2.0" 499 0 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 29.692
2022-10-11T04:58:16Z app[4f19a8e1] iad [info]45.33.100.21 - - [11/Oct/2022:04:58:16 +0000] "GET https://example.org/api/v1/health/redis HTTP/2.0" 499 0 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 29.803
2022-10-11T04:58:21Z app[3dd9e2c4] ewr [info]23.88.41.31 - - [11/Oct/2022:04:58:21 +0000] "GET https://example.org/api/v1/health/database HTTP/2.0" 200 36 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 0.043
2022-10-11T04:58:22Z app[4f19a8e1] iad [info]45.33.100.21 - - [11/Oct/2022:04:58:22 +0000] "GET https://example.org/api/v1/health/ping HTTP/2.0" 499 0 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 29.524
2022-10-11T04:58:23Z app[3dd9e2c4] ewr [info]172.105.169.250 - - [11/Oct/2022:04:58:23 +0000] "GET https://example.org/api/v1/health/database HTTP/2.0" 499 0 "-" "Better Uptime Bot Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 29.292

I don’t see any recently failed instances by the app causing the timeouts, the only odd thing that it’s not running an instance in iad anymore but two in yyz instead.

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS 	HEALTH CHECKS     	RESTARTS	CREATED
06a622a0	app    	261    	ewr   	run    	running	1 total, 1 passing	0       	2022-10-09T17:19:35Z
40aa2108	app    	261    	yyz   	run    	running	1 total, 1 passing	0       	2022-10-03T04:53:43Z
33177a4f	app    	261    	iad   	stop   	failed 	1 total, 1 passing	0       	2022-09-30T10:41:53Z
65f2c680	app    	261    	yyz   	run    	running	1 total, 1 passing	0       	2022-09-30T10:40:55Z

There are as well no logs indicating that the app instances are unhealthy.

Same on nginx side:

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS  	HEALTH CHECKS     	RESTARTS	CREATED
22468dbb	app    	32 ⇡   	yyz   	run    	running 	1 total, 1 passing	0       	32m42s ago
4f19a8e1	app    	32 ⇡   	iad   	run    	running 	1 total, 1 passing	0       	33m4s ago
3dd9e2c4	app    	32 ⇡   	ewr   	run    	running 	1 total, 1 passing	0       	33m4s ago
700f9f62	app    	31     	yyz   	stop   	complete	1 total, 1 passing	0       	22h7m ago
0da81b6b	app    	31     	iad   	stop   	complete	1 total, 1 passing	0       	22h7m ago
37329f29	app    	31     	ewr   	stop   	complete	1 total, 1 passing	0       	22h7m ago

I scaled back to 1 instance for all applications and that resolved the issue.

Now I scaled back to 3 instances and it’s happening again.

The only application that I can run with 3 or more instances seem to be the nginx proxy. But as soon as I scale any of the application instances past 1 I get timeout issues.

Not sure why restarting the applications would not reset the connections but I don’t know enough about how internal networking works at fly to tell what’s going on. I don’t understand either why this first started to happen when I was deploying an unrelated app.

Please help to troubleshoot. I will run my applications with a single instance for now as it seems to remedy the problem but that’s obviously no long term solution. Especially as with a single instance I was seeing timeout and connection issues whenever I was doing a new deployment. For reference that’s what @kurt recommended here: Internal network DNS propagation and new deployments - #3 by jascha

1 Like

Hi Jascha can you share a screenshot of your Nginx config.

# Multi Tenant App
server {

    listen 443 ssl http2;
    listen [::]:443 ssl http2;

    ssl_certificate /etc/ssl/SSL_CERT.pem;
    ssl_certificate_key /etc/ssl/SSL_KEY.key;

    ssl_client_certificate /etc/nginx/certs/cloudflare.crt;
    ssl_verify_client on;

    # Catch all @see https://nginx.org/en/docs/http/server_names.html#miscellaneous_names
    server_name _;

    #Fly.io private networking DNS service @see https://community.fly.io/t/internal-network-dns-propagation-and-new-deployments/3672/3
    resolver [fdaa::3] valid=5s;
    resolver_timeout 5s;

    location / {
        # @see https://community.fly.io/t/internal-network-dns-propagation-and-new-deployments/3672/3
        set $backend "${APP_BACKEND}";
        proxy_pass $backend;
        proxy_set_header Host $host;
    }

}

# Marketing Website
server {

    listen 443 ssl http2;
    listen [::]:443 ssl http2;

    ssl_certificate /etc/ssl/SSL_CERT.pem;
    ssl_certificate_key /etc/ssl/SSL_KEY.key;

    ssl_client_certificate /etc/nginx/certs/cloudflare.crt;
    ssl_verify_client on;

    # Catch all @see https://nginx.org/en/docs/http/server_names.html#miscellaneous_names
    server_name <redacted>;

    #Fly.io private networking DNS service @see https://community.fly.io/t/internal-network-dns-propagation-and-new-deployments/3672/3
    resolver [fdaa::3] valid=5s;
    resolver_timeout 5s;

    location / {
        # @see https://community.fly.io/t/internal-network-dns-propagation-and-new-deployments/3672/3
        set $backend "${WEBSITE_BACKEND}";
        proxy_pass $backend;
        proxy_set_header Host $host;
    }

}

# Onboarding and free Dashboard
server {

    listen 443 ssl http2;
    listen [::]:443 ssl http2;

    ssl_certificate /etc/ssl/SSL_CERT.pem;
    ssl_certificate_key /etc/ssl/SSL_KEY.key;

    ssl_client_certificate /etc/nginx/certs/cloudflare.crt;
    ssl_verify_client on;

    # Catch all @see https://nginx.org/en/docs/http/server_names.html#miscellaneous_names
    server_name <redacted>;

    #Fly.io private networking DNS service @see https://community.fly.io/t/internal-network-dns-propagation-and-new-deployments/3672/3
    resolver [fdaa::3] valid=5s;
    resolver_timeout 5s;

    location / {
        # @see https://community.fly.io/t/internal-network-dns-propagation-and-new-deployments/3672/3
        set $backend "${ONBOARDING_BACKEND}";
        proxy_pass $backend;
        proxy_set_header Host $host;
    }

}

The environment variables get replaced at runtime with the internal DNS names, e.G http://fly-app-name.internal:8080

1 Like

I think this may be failing when it hits a stale internal DNS entry. Our internal DNS occasionally returns bad IPs, we had a server in Toronto fail on Saturday that apparently left quite a few in the rotation.

I don’t think this nginx config will work all that well. It’s not load balancing across IPs properly, and there’s a very good chance that it’ll send request to far away upstreams.

What are you actually trying to do here? Just terminate TLS for CloudFlare? You might have more success with their Tunnel. I think you could probably configured NGINX to handle multiple internal IPs well, and avoid unhealthy ones, you may be able to adapt the config here to do that: GitHub - fly-apps/nginx-cluster: A horizontally scalable NGINX caching cluster