We have since day one experienced very slow, basically unusable response times when our customers in Europe try to access our apps that are all hosted in ORD.
These are pretty straightforward graphql api’s written with node/prisma just for some background.
When we hosted over at Heroku for instance, this would still remain speedy with very little increase in response times, is there anything we are doing wrong? It feels like there is routing issues possibly?
Can you give us an example request? Also, do you know where in Europe?
One easy way to test this is to launch another app in Europe, SSH to it, and start using your own API. Europe <-> US should actually be faster than Heroku because we do TLS termination in Europe.
Thanks for such a quick reply! Let me gather a test request for you (most are behind auth), might be best to DM that over to ya
We honestly see this issues almost everywhere outside the US, nowhere in particular, really anywhere overseas.
I will spin up a generic image over there in a few places and see if we see anything out of the norm.
Some other context, when we first started deploying on fly a while back, we had our PG server in ORD and a few US region api servers spread out and we also saw really bad performance, but my guess was that was caused by Prisma making a ton of requests to DB on the api server, so a TON of region to region requests piled up.
From my understanding now that everything is in ORD, we should only have the initial trip to API server per requests and one more back with a small amount of delay when a user hits our API from Europe?
Something that was odd, I was actually seeing some slow requests hanging in US as well, but after an app restart command, we are now running smoothly in the US as well.
So we do have a URL that we use as a health check, but this does not interact with the DB, etc. like a typical response would do and this is also responding very quickly now from a FRA instance via fly ssh (~100ms).
Is there anything that would cause something to be “fixed” upon a app restart at the fly level?
Not at the fly level, no. Well, probably no. In theory there’s state in our proxies that could cause something like this, but we don’t see anything similar happening on other apps.
I have had this happen on Node apps before. If they’ve been running for days and grind to a halt, it might be a memory or event loop leak. If the health check itself slowed down, it’s probably something in the Node process. This could be exacerbated over long distances just because it is slower for packets to round trip.
That makes a ton of sense, I appreciate the explanation.
I will let you know if we continue to see issues, would you happen to have any suggestions on keeping the apps “fresh” by some kind of cron process to restart the apps in the case that new deployments are not happening?