I’ve been running my application on Fly.io for about a year now, and it’s been awesome!
Unfortunately, I lost a few clients over the past month, so I’m currently looking for ways to reduce my infrastructure costs by around 30%.
My app has close to zero usage during the night, so I implemented a scale up/down GitHub Action to adjust CPU and RAM for both Nginx and the app itself depending on the time of day.
I’m also running a legacy Postgres cluster with 3 replicas (each on shared-cpu-1x@512MB), and I’m planning to apply the following changes using the Fly API:
During the day:
Scale up the primary to shared-cpu-2x@512MB
Keep the replicas at shared-cpu-1x@512MB
During the night:
Scale down the primary to shared-cpu-1x@512MB
Scale down the replicas to shared-cpu-1x@256MB
I’d appreciate some community input on a few points:
Does Postgres handle slower replicas gracefully?
Do I need to change any internal Postgres configurations to support this?
Are there any downsides to this approach?
I’m aware that in the event of a primary failure, performance could be degraded until the cluster is fully restored.
Are there any potential issues I might be overlooking?
Hm… My own intuition is that this is fairly risky, overall, although I completely understand the motivation…
Reducing RAM is one of the cases that the official docs highlight (with a marker and a call-out box on a bright orange background) as needing extra care.
I don’t know the Fly Postgres internals well enough to say whether you could convince it to use one set of Postgres-specific knobs for Machine A but a different set for Machines B and C. Last time I glanced through the source code, it looked like it was trying to enforce the same for all.
(And if you can’t convince it, then you’ll have extra RAM that you (mostly) can’t actually take advantage of.)
Slower CPU (as opposed to less RAM) is maybe tolerable in some cases but always firmly in the “not recommended” bracket, as explained in the following quote that you may have already seen before…
On that same note, your nightly scaling operation of the primary may trigger a leadership change on its own, since that Machine inherently has to be taken down and rebooted as part of the update. At a minimum, I think you would need an extra step to transfer the role back—after health checks have all passed again.
This is a lot of things that could go wrong some evening…
Just as another idea, could you swap to SQLite with LiteFS? For small projects it sounds ideal, and of course it does not need a database server specifically.