Server misbehaving

AsymetricalData · May 29, 2023, 12:35pm

Still the error, for three days now and still no communication about it.
Can you please keep us in touch ?

We can understand if it’s complicated or whatever, but having no updates at all is very frustrating. It’s not the first time something like that happens (no communication at all).

Please, keep us in touch.

FrankSilver · May 29, 2023, 3:01pm

Hi @mfilej,

flyctl deploy --build-only

You’re right but this only builds the image and pushes it to the registry.

It does NOT deploy it and the problem remains the same if you try to deploy it with:
flyctl deploy --image registry.fly.io/your-app-name:deployment-RANDOMKEY

But thanks for pointing the fact that the image builds correctly, it may help the Fly team to solve this quicker.

Now as @AsymetricalData mentioned, the team’s communication is definitely lacking + there’s nothing on https://status.flyio.net/ as if the problem wasn’t even being taken care of or considered…

This is getting very frustrating, especially when I was making a whole tutorial around it…
I feel you @Archer and I won’t defend them on this one…

rugwiro · May 29, 2023, 5:40pm

There is an ongoing issue with DNS nameserver that is affecting some regions. The team is hard work to fix.

jssjr · May 29, 2023, 6:58pm

First off, we’re really sorry this persisted as long as it did and definitely could have done more to communicate what we were doing behind the scenes to try and both reproduce the issue and fix it.

This was a weird one to debug. In the past couple of weeks we’ve been testing out operating system updates on our host fleet. You’ve probably had to update major OS release before and know that it can be a slog to go over every possible settings change and ensure things are performing as expected. We were pretty confident last week that we’d ironed out all the remaining bugs so we moved a really tiny subset of traffic over to some newly rebuilt servers. A small enough amount that the error described here wasn’t showing up clearly on our aggregated logs. And the error was infrequent enough on a single host that we weren’t catching it in the host level logs.

Way down deep in a template inside our config management system we have a conditional that looks for the server’s role in our fleet and uses that to make decisions about how to configure anycast IPs on that host. In order to test the new OS build we slightly changed the server role name. (You can probably see where this is going…) The couple of newly provisioned hosts ended up bringing up the anycast IP for the public fly.dev DNS service. You wouldn’t notice this from the server OS because DNS follows a different path than if you’re on a firecracker VM.

But, from within a VM you’ll end up triggering a recursive query for the registry service that is routed to the local server (instead of the global DNS service) which times out. The “server misbehaving” error was misleading and it required some sleuthing to determine exactly what in the path was failing and which timeout was being reached. Ultimately, patching the config management scripts and applying the changes to the server fixed the problem.

orthoplant · May 29, 2023, 7:04pm

It is fixed! Thank you so much for your hard work.

FrankSilver · May 29, 2023, 7:55pm

Hi @jssjr !

Thanks for the explanation and the hard work.
I can confirm it is now fixed !!!

This was driving me nuts!

AeonSake · May 29, 2023, 8:15pm

Deploying works again, thanks for the fix!

AsymetricalData · May 30, 2023, 9:28am

Thanks it works great since your message.

The main problem is not that the incident lasted so long, but the lack of communication (as always with Fly)

I think everyone would have appreciated just a message telling us that you were working on it.
Without communication, we didn’t even know if you were aware of the problem, or even if it was being resolved. Very interesting explanation, by the way.

system · June 6, 2023, 9:29am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
"Server misbehaving" error on first deployment Build debugging	5	693	August 1, 2023
failed to fetch an image or build from source: error rendering push status stream: Get "https://registry.fly.io/v2/": dial tcp: lookup registry.fly.io on [fdaa::3]:53: server misbehaving Questions / Help rails , registry	2	158	February 12, 2024
failed to fetch from https://registry.fly.io/v2/ Questions / Help	2	418	April 21, 2023
Failed to fetch or build image from source Build debugging	1	229	August 8, 2023
Can't deploy via `fly deploy` registry	8	355	December 29, 2023

Server misbehaving

Related topics