I know that the [services.concurrency] section in fly.toml controls how many concurrent connections my app can accept, but how do I know how many I am actually using?
I see fly_app_tcp_connects_count in my exposed metrics, but this seems to correspond to opens not to active connections. Or is this something I should be trying to monitor from within my (LiveView) application?
The context is I’m trying to debug an issue where some requests are timing out. I thought it had to do with deploys ("Error 1: Undocumented" after deploy & missing logs) but I’m still hitting these timeouts outside a deploy window. I suspect that I may be running out of concurrent connections. My fly.toml sets a soft limit of 100 and hard limit of 500 – how do I know if I’m regularly hitting 500 active connections?
As a side-note, I see livebeats uses 2500. What kind of VM is it deployed onto?
doh… well those numbers were so low I assumed it couldn’t be the right one.
You’re right it’s definitely not me hitting the concurrency limit.
If you have any insights as to what’s going on or how to debug further it would be much appreciated. Twilio is telling me that it had a timeout to a webhook (with 5 retries!) at 2022-02-23 01:57:23 UTC and I can’t find any record of this in my logs. They also had failures at 2022-02-22 23:08:29 UTC, 2022-02-22 23:08:27 UTC, 2022-02-22 23:08:26 UTC, 2022-02-22 23:08:23 UTC and 2022-02-22 23:08:19 UTC. None of these overlap with deploys and now that I know I wasn’t running out of concurrent connections I’m out of ideas
All I’m getting from them is error 11200 and information that they timed out after 15seconds with 5 retries. It would either be to yyz or ewr. Not sure how to test which instance they were connected to.
I don’t see any restarts, but I deployed since this incident - is there a way to see restarts in the past? I don’t see anything related to restarts in my activity dashboard or when searching logs
We have been seeing reports of slow response times. It seems to be consistent for users but only from certain geographies, so we suspect that for example iad is running fine but mia is slow.
We don’t have a limit on response times exactly. If a connection doesn’t send or receive any bytes during a 60s window, we’ll close the connection. If it continuously sends or receives some data, we will not close the connection and the response will continue.
I’ve been looking at the metrics for the specific app you mentioned and it appears that the app itself is responding slowly.
Specifically, I’m seeing the same graph for both our proxy response times and your app’s response time. This would mean our proxy isn’t responsible for the slowness.
It seems like fly.io is sometimes stuck at talking to our database (on compose.com), but not sure if that’s fly.io issue or DB connection pool or some application logic at the moment
AWS seems to be having networking issues in us-east-1 that are affecting database connections and requests to AWS hosted infrastructure. If you all are running DBs on AWS, that might be the source of inexplicable slowness.
The best way to test is to ssh into your app server, install mtr, and then run mtr <database-hostname> to see where the network actually takes you.
You might have better luck running your apps in ewr than iad if you’re using AWS services right now.
I see, that makes some sense as our compose DBs are hosted on us-east-1
We accidentally closed the iad server today because we were seeing elevated failure rates on that node and removed it from the pool… and does seem more stable
The AWS issues in Virginia wouldn’t have any impact in Chicago.
The first place to look when you have response time spikes is some kind of app performance monitoring. It’s worth installing Honeycomb or Datadog or something that can inspect DB queries to see if there’s anything that’s especially slow. I know you were getting pg_stats going, but instrumenting the app itself is often more helpful.
Tracing with Honeycomb will help you find all kinds of issues, it’s very much worth the effort to integrate.
Thanks @kurt - we will get some more tools added to trace down the issues. We do have Sentry running perf on our gql schema at the moment, but our largest issue is the response time from refresh to refresh, 100ms to 1200ms is a huge jump and there is nothing obvious in sentry that would cause this issue.