How do I know how many connections are open?

I know that the [services.concurrency] section in fly.toml controls how many concurrent connections my app can accept, but how do I know how many I am actually using?

I see fly_app_tcp_connects_count in my exposed metrics, but this seems to correspond to opens not to active connections. Or is this something I should be trying to monitor from within my (LiveView) application?

The context is I’m trying to debug an issue where some requests are timing out. I thought it had to do with deploys ("Error 1: Undocumented" after deploy & missing logs) but I’m still hitting these timeouts outside a deploy window. I suspect that I may be running out of concurrent connections. My fly.toml sets a soft limit of 100 and hard limit of 500 – how do I know if I’m regularly hitting 500 active connections?

As a side-note, I see livebeats uses 2500. What kind of VM is it deployed onto?

The fly_app_concurrency metric is what you want, you can see it in our dashboard:

LiveBeats is on shared CPU VMs with 2GB of RAM I think?

You’ll actually see logs about the hard limit getting reached if that’s what’s happening, though.

doh… well those numbers were so low I assumed it couldn’t be the right one.

You’re right it’s definitely not me hitting the concurrency limit.

If you have any insights as to what’s going on or how to debug further it would be much appreciated. Twilio is telling me that it had a timeout to a webhook (with 5 retries!) at 2022-02-23 01:57:23 UTC and I can’t find any record of this in my logs. They also had failures at 2022-02-22 23:08:29 UTC, 2022-02-22 23:08:27 UTC, 2022-02-22 23:08:26 UTC, 2022-02-22 23:08:23 UTC and 2022-02-22 23:08:19 UTC. None of these overlap with deploys and now that I know I wasn’t running out of concurrent connections I’m out of ideas :heart: :slight_smile:

Do they give you any diagnostic info on those? And do you have any idea what region they’re connecting to?

One thing to check is app restarts. fly status will show you if your VMs have restarted due to either a crash or a health check failure.

All I’m getting from them is error 11200 and information that they timed out after 15seconds with 5 retries. It would either be to yyz or ewr. Not sure how to test which instance they were connected to.

I don’t see any restarts, but I deployed since this incident - is there a way to see restarts in the past? I don’t see anything related to restarts in my activity dashboard or when searching logs

We have been seeing reports of slow response times. It seems to be consistent for users but only from certain geographies, so we suspect that for example iad is running fine but mia is slow.

We are trying the suggestion from here: Slow response times? - #20 by kurt first

Edited to add: restarting seems to help the situation, but then the slowness starts again the next day

We continue to see elevated failure rates and not sure why users can’t connect to our fly.io instances.

Our fly.io logs show many lines saying like: [error]Error 2003: App connection closed before request/response completed

This error means that the connection between us and your app was closed before we could send a full request or receive a full response.

Can you show us your fly.toml and give us more details about your app?

We are seeing 10s+ response times (or, basically never respond), but this resolves whenever we restart the app

app = "falling-cherry-3608"

kill_signal = "SIGINT"
kill_timeout = 5

[env]
  PORT = "8080"

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  http_checks = []
  internal_port = 8080
  protocol = "tcp"
  script_checks = []

  [services.concurrency]
    hard_limit = 2500
    soft_limit = 2000
    type = "requests"

  [[services.ports]]
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "1s"
    interval = "15s"
    restart_limit = 6
    timeout = "2s"

Hmm, do you have a response limit? For example does fly.io automatically cut off requests longer than 30s or something.

We don’t have a limit on response times exactly. If a connection doesn’t send or receive any bytes during a 60s window, we’ll close the connection. If it continuously sends or receives some data, we will not close the connection and the response will continue.

I’ve been looking at the metrics for the specific app you mentioned and it appears that the app itself is responding slowly.

Do you have metrics or logs that would give you latency numbers? We expose metrics to our users, some of them like fly_app_http_response_time_seconds might help. You can setup a Grafana dashboard for this, outlined in our docs. There’s also a pre-built Grafana dashboard with useful metrics.

Specifically, I’m seeing the same graph for both our proxy response times and your app’s response time. This would mean our proxy isn’t responsible for the slowness.

Thanks for the clarification, we will take a look at a few other sources of latency.

We are also seeing this to be true, we have tried type = “requests” but results seem to be the same.

Did you end up figuring this out?

It seems like fly.io is sometimes stuck at talking to our database (on compose.com), but not sure if that’s fly.io issue or DB connection pool or some application logic at the moment

AWS seems to be having networking issues in us-east-1 that are affecting database connections and requests to AWS hosted infrastructure. If you all are running DBs on AWS, that might be the source of inexplicable slowness.

The best way to test is to ssh into your app server, install mtr, and then run mtr <database-hostname> to see where the network actually takes you.

You might have better luck running your apps in ewr than iad if you’re using AWS services right now.

I see, that makes some sense as our compose DBs are hosted on us-east-1

We accidentally closed the iad server today because we were seeing elevated failure rates on that node and removed it from the pool… and does seem more stable

Thanks kurt, we are seeing these inconsistent response times with fly hosted apps and fly pg that are all hosted in ord.

Is there anything that you can think of that might be causing this that we can look into?

Thanks!

The AWS issues in Virginia wouldn’t have any impact in Chicago.

The first place to look when you have response time spikes is some kind of app performance monitoring. It’s worth installing Honeycomb or Datadog or something that can inspect DB queries to see if there’s anything that’s especially slow. I know you were getting pg_stats going, but instrumenting the app itself is often more helpful.

Tracing with Honeycomb will help you find all kinds of issues, it’s very much worth the effort to integrate.

1 Like

Thanks @kurt - we will get some more tools added to trace down the issues. We do have Sentry running perf on our gql schema at the moment, but our largest issue is the response time from refresh to refresh, 100ms to 1200ms is a huge jump and there is nothing obvious in sentry that would cause this issue.