How do I know how many connections are open?

mveytsman · February 23, 2022, 3:22am

I know that the [services.concurrency] section in fly.toml controls how many concurrent connections my app can accept, but how do I know how many I am actually using?

I see fly_app_tcp_connects_count in my exposed metrics, but this seems to correspond to opens not to active connections. Or is this something I should be trying to monitor from within my (LiveView) application?

The context is I’m trying to debug an issue where some requests are timing out. I thought it had to do with deploys ("Error 1: Undocumented" after deploy & missing logs) but I’m still hitting these timeouts outside a deploy window. I suspect that I may be running out of concurrent connections. My fly.toml sets a soft limit of 100 and hard limit of 500 – how do I know if I’m regularly hitting 500 active connections?

As a side-note, I see livebeats uses 2500. What kind of VM is it deployed onto?

kurt · February 23, 2022, 3:35am

The fly_app_concurrency metric is what you want, you can see it in our dashboard:

LiveBeats is on shared CPU VMs with 2GB of RAM I think?

You’ll actually see logs about the hard limit getting reached if that’s what’s happening, though.

mveytsman · February 23, 2022, 3:55am

doh… well those numbers were so low I assumed it couldn’t be the right one.

You’re right it’s definitely not me hitting the concurrency limit.

If you have any insights as to what’s going on or how to debug further it would be much appreciated. Twilio is telling me that it had a timeout to a webhook (with 5 retries!) at 2022-02-23 01:57:23 UTC and I can’t find any record of this in my logs. They also had failures at 2022-02-22 23:08:29 UTC, 2022-02-22 23:08:27 UTC, 2022-02-22 23:08:26 UTC, 2022-02-22 23:08:23 UTC and 2022-02-22 23:08:19 UTC. None of these overlap with deploys and now that I know I wasn’t running out of concurrent connections I’m out of ideas

kurt · February 23, 2022, 4:02am

Do they give you any diagnostic info on those? And do you have any idea what region they’re connecting to?

kurt · February 23, 2022, 4:02am

One thing to check is app restarts. fly status will show you if your VMs have restarted due to either a crash or a health check failure.

mveytsman · February 23, 2022, 4:24am

All I’m getting from them is error 11200 and information that they timed out after 15seconds with 5 retries. It would either be to yyz or ewr. Not sure how to test which instance they were connected to.

I don’t see any restarts, but I deployed since this incident - is there a way to see restarts in the past? I don’t see anything related to restarts in my activity dashboard or when searching logs

qiming · March 1, 2022, 10:29pm

We have been seeing reports of slow response times. It seems to be consistent for users but only from certain geographies, so we suspect that for example iad is running fine but mia is slow.

We are trying the suggestion from here: Slow response times? - #20 by kurt first

Edited to add: restarting seems to help the situation, but then the slowness starts again the next day

qiming · March 2, 2022, 9:07pm

We continue to see elevated failure rates and not sure why users can’t connect to our fly.io instances.

Our fly.io logs show many lines saying like: [error]Error 2003: App connection closed before request/response completed

jerome · March 2, 2022, 10:30pm

This error means that the connection between us and your app was closed before we could send a full request or receive a full response.

Can you show us your fly.toml and give us more details about your app?

qiming · March 2, 2022, 11:57pm

We are seeing 10s+ response times (or, basically never respond), but this resolves whenever we restart the app

app = "falling-cherry-3608"

kill_signal = "SIGINT"
kill_timeout = 5

[env]
  PORT = "8080"

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  http_checks = []
  internal_port = 8080
  protocol = "tcp"
  script_checks = []

  [services.concurrency]
    hard_limit = 2500
    soft_limit = 2000
    type = "requests"

  [[services.ports]]
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "1s"
    interval = "15s"
    restart_limit = 6
    timeout = "2s"

qiming · March 3, 2022, 1:15am

Hmm, do you have a response limit? For example does fly.io automatically cut off requests longer than 30s or something.

jerome · March 3, 2022, 12:48pm

We don’t have a limit on response times exactly. If a connection doesn’t send or receive any bytes during a 60s window, we’ll close the connection. If it continuously sends or receives some data, we will not close the connection and the response will continue.

I’ve been looking at the metrics for the specific app you mentioned and it appears that the app itself is responding slowly.

Do you have metrics or logs that would give you latency numbers? We expose metrics to our users, some of them like fly_app_http_response_time_seconds might help. You can setup a Grafana dashboard for this, outlined in our docs. There’s also a pre-built Grafana dashboard with useful metrics.

Specifically, I’m seeing the same graph for both our proxy response times and your app’s response time. This would mean our proxy isn’t responsible for the slowness.

qiming · March 3, 2022, 2:56pm

Thanks for the clarification, we will take a look at a few other sources of latency.

danwetherald · March 3, 2022, 10:04pm

We are also seeing this to be true, we have tried type = “requests” but results seem to be the same.

Did you end up figuring this out?

qiming · March 3, 2022, 10:19pm

It seems like fly.io is sometimes stuck at talking to our database (on compose.com), but not sure if that’s fly.io issue or DB connection pool or some application logic at the moment

kurt · March 3, 2022, 11:07pm

AWS seems to be having networking issues in us-east-1 that are affecting database connections and requests to AWS hosted infrastructure. If you all are running DBs on AWS, that might be the source of inexplicable slowness.

The best way to test is to ssh into your app server, install mtr, and then run mtr <database-hostname> to see where the network actually takes you.

You might have better luck running your apps in ewr than iad if you’re using AWS services right now.

qiming · March 3, 2022, 11:54pm

I see, that makes some sense as our compose DBs are hosted on us-east-1

We accidentally closed the iad server today because we were seeing elevated failure rates on that node and removed it from the pool… and does seem more stable

danwetherald · March 4, 2022, 9:25pm

Thanks kurt, we are seeing these inconsistent response times with fly hosted apps and fly pg that are all hosted in ord.

Is there anything that you can think of that might be causing this that we can look into?

Thanks!

kurt · March 4, 2022, 11:21pm

The AWS issues in Virginia wouldn’t have any impact in Chicago.

The first place to look when you have response time spikes is some kind of app performance monitoring. It’s worth installing Honeycomb or Datadog or something that can inspect DB queries to see if there’s anything that’s especially slow. I know you were getting pg_stats going, but instrumenting the app itself is often more helpful.

Tracing with Honeycomb will help you find all kinds of issues, it’s very much worth the effort to integrate.

danwetherald · March 7, 2022, 10:55pm

Thanks @kurt - we will get some more tools added to trace down the issues. We do have Sentry running perf on our gql schema at the moment, but our largest issue is the response time from refresh to refresh, 100ms to 1200ms is a huge jump and there is nothing obvious in sentry that would cause this issue.

Topic		Replies	Views
problem increasing app's connections `hard_limit`	4	1291	July 9, 2023
Concurrency connection limits in fly.toml for machines recommended values? Questions / Help machines	2	718	March 24, 2023
Figuring out the best concurrency limits Questions / Help	5	409	September 21, 2023
Fly Machine doesn't follow fly.toml changes for concurrency limits Questions / Help	5	196	March 9, 2024
New Concurrency Hard Limit Default Fresh Produce	0	326	May 29, 2024

How do I know how many connections are open?

Related topics