Still the error, for three days now and still no communication about it.
Can you please keep us in touch ?
We can understand if it’s complicated or whatever, but having no updates at all is very frustrating. It’s not the first time something like that happens (no communication at all).
You’re right but this only builds the image and pushes it to the registry.
It does NOT deploy it and the problem remains the same if you try to deploy it with: flyctl deploy --image registry.fly.io/your-app-name:deployment-RANDOMKEY
But thanks for pointing the fact that the image builds correctly, it may help the Fly team to solve this quicker.
Now as @AsymetricalData mentioned, the team’s communication is definitely lacking + there’s nothing on https://status.flyio.net/ as if the problem wasn’t even being taken care of or considered…
This is getting very frustrating, especially when I was making a whole tutorial around it…
I feel you @Archer and I won’t defend them on this one…
First off, we’re really sorry this persisted as long as it did and definitely could have done more to communicate what we were doing behind the scenes to try and both reproduce the issue and fix it.
This was a weird one to debug. In the past couple of weeks we’ve been testing out operating system updates on our host fleet. You’ve probably had to update major OS release before and know that it can be a slog to go over every possible settings change and ensure things are performing as expected. We were pretty confident last week that we’d ironed out all the remaining bugs so we moved a really tiny subset of traffic over to some newly rebuilt servers. A small enough amount that the error described here wasn’t showing up clearly on our aggregated logs. And the error was infrequent enough on a single host that we weren’t catching it in the host level logs.
Way down deep in a template inside our config management system we have a conditional that looks for the server’s role in our fleet and uses that to make decisions about how to configure anycast IPs on that host. In order to test the new OS build we slightly changed the server role name. (You can probably see where this is going…) The couple of newly provisioned hosts ended up bringing up the anycast IP for the public fly.dev DNS service. You wouldn’t notice this from the server OS because DNS follows a different path than if you’re on a firecracker VM.
But, from within a VM you’ll end up triggering a recursive query for the registry service that is routed to the local server (instead of the global DNS service) which times out. The “server misbehaving” error was misleading and it required some sleuthing to determine exactly what in the path was failing and which timeout was being reached. Ultimately, patching the config management scripts and applying the changes to the server fixed the problem.
The main problem is not that the incident lasted so long, but the lack of communication (as always with Fly)
I think everyone would have appreciated just a message telling us that you were working on it.
Without communication, we didn’t even know if you were aware of the problem, or even if it was being resolved. Very interesting explanation, by the way.