Frankfurt, we have a problem.

Hello everyone,

Multiple days ago, deploying in FRA was nearly impossible or very chaotic.

Yesterday, Postgres databases in FRA were down for at least an hour.

Is there something wrong with this region? Should I take the time to migrate to another region, perhaps AMS? These are the questions I asked myself yesterday evening when facing this issue.

My main SaaS product is used by restaurants to distribute their menu digitally. Yesterday, their customers were not able to see the menu… You can see why this is problematic.

Today, I’m trying to figure out what went wrong and how I should handle this in the future. Sure, Fly.io had difficulties. Such is life. Sh*t happens, and I don’t blame Fly.

But clearly, I forgot to have a contingency plan. That’s my fault.

Can someone give me guidance on what should I do? I’m not an infrastructure expert, but I’d really like to level up, and perhaps this thread could help others in the future.

What I have in mind currently is:

Have a replicate of the database in another region? Currently I have a two instances setup (master & slave), but I believe they are both in the same region, so it didn’t help when FRA was down.

Have a backup Node process in case it’s not the database but the datacenter running the Express (Remix) server that is down. How am I suppose to do that? Am I suppose to play with some load balancing/nginx to achieve that?

These are the two ideas I have, but I’m sure the question is bigger, and the answer more complex that these two food for thoughts idea.

I’d love if someone could help me figure out what is the best strategy.

1 Like

Hi,

Well … to take the fra issue first, according to the status page, it was just one host that failed: Fly.io Status - FRA - application host failure

I’d suggest splitting the problem in two: separate making compute HA from the database (state) HA.

Making the compute (app) globally/highly available is easier. The simplest option would be to deploy to multiple regions. Fly supports that out of the box :slight_smile:. You can have an anycast IP. That points at the closest one. So … let’s say you deploy to lhr and fra, you then have two vms and so two regions. If fra is down, well your customers in the UK (for example) would continue to be served (by the other vm in lhr). So that would create redundancy in a single command (a larger region pool). The question then would be what would happen to your users in France, as the anycast would point them to the vm in fra :thinking:. That would depend on what Fly does to those requests. That I don’t know. I’d assume a region is not taken out of service (and all its requests diverted elsewhere - in this case to lhr) unless something goes very wrong. In which case your users in France would see downtime until your app/vm was brought back e.g on another host (as would have happened yesterday as someone from Fly fixes the issue with the faulty host).

So … how would you solve that (so users in France would continue to be served despite a host in fra being down)? That’s where it gets trickier. If Fly’s routing continues to try to send requests to a region/host that is down, you’d have to put your own logic in front … somehow. There are a variety of approaches. You could do it with DNS (for example AWS Route53 has healthchecks). Or you could add a load balancer, as you say (managed or self-hosted, like HAProxy). Of course that moves the HA problem somewhere else - now your load balancer has to be up. Else nobody gets any response (whether in UK, France, or anywhere else!). So where do you put that :thinking:? I happened to be thinking about this yesterday since I was looking at how other companies do it. I found one interesting approach used by Doppler. They happen to use Google Cloud for their servers but the interesting part is they don’t point their client’s requests directly to the GCP load balancer. Instead they use a Cloudflare Worker and that checks that the region is up. If not, it would seamlessly failover to another one. You can read more about that here Secrets Manager (scroll up for their infrastructure diagram). I was wondering about something similar with Fly instead of GCP. So the Worker (which is free - at least below a certain level each month) accepts the request and it decides where to send the request, making sure the server is healthy before doing so. I haven’t tried doing it but could be worth playing with.

A globally consistently database (state) is a whole other question. One for another time :slight_smile:.

Thank you for taking the time to answer my question. I must say I’m facing new terminology, but at least I know what to look for!

Can’t wait for the part two regarding the databases :wink:

I feel guidance/blog posts from Fly or any expert could be appreciated by the community. Learning how to build strong and resilient infrastructure would really help in the long run and create a win-win situation.

1 Like