We fixed an issue where MPG backup jobs could get stuck retrying forever and sometimes needed a manual nudge to start behaving again.
What was going on?
As you may (or not) know, our MPG clusters run on top of our custom Postgres Operator, which runs on top of our custom Kubernetes distribution that orchestrates resources using Fly’s infrastructure.
To keep clusters safe, all communications to and from MPG clusters requires valid certificates. For every service that interacts with a cluster (say, backups), we create a new certificate. They are stored as Fly Secrets and exposed to Kubernetes via a custom Virtual Kubelet implementation.
These certificates are constantly rotated. In order for machines to pick up these updates, they need to be restarted.
Even though we keep a process to check for file changes in the backup server, it did not pick up secret changes. This caused backup jobs to retry indefinitely due to connection issues with the database.
How we fixed it?
We changed how we track for certificate changes. Instead of relying on a separate process to tell us when certificates changed, we track when the object is updated during reconciliation and keep it in an annotation under the backup server statefulset. This triggers an automatic version rollout whenever there’s a secret change.
When will this be available?
We are gradually rolling out this fix over a week, starting last Wednesday. By the end of this week, all MPG clusters will be updated. If you are an MPG customer, you shouldn’t notice a thing (apart from faster backups).