In the last couple of days I’ve started receiving 503 when creating apps via the machine api and I believe it’s the volume endpoint:
https://api.machines.dev/v1/apps/<app>/volumes
I’ve been creating volumes successfully for months now and this has only just started affecting me. It seems to affect about 50% of requests so wondered if anyone else is experiencing this?
Hi, @tommyd. Thanks for calling this out and sorry for not replying more quickly. I’m an L4 working on core orchestration here.
What’s happening here is that our platform is doing a bad job communicating about a transient delay. We’re showing you 503 errors (see screenshot below), which you can’t do much with other than worry that the whole platform is broken, when really what we need to be communicating is “try this again in a minute”.
Behind the scenes, what’s happening here is that when we create volumes, our orchestration layer has to link up with our secret storage layer to obtain encryption keys. For many years now, that secret storage layer has been Hashicorp Vault, which operates out of a central Raft cluster in IAD, far from ARN.
We’re just wrapping up a long project to augment Vault with our own in-house secret store, Pet Sematary, to specifically eliminate transient delays like these. That’s rolling out now, and should smooth out issues like this.
Regardless of the error message we’re generating to customers, when these things happen, our infra ops team is notified, and we watch this stuff carefully. But we do need to do a better job helping you understand what’s happening.
Thanks for the reply. It seems to have got worse in the last 24 hours or so where almost all requests to make a volume are failing with 503s so I look forward to your updates to fix this.