fly machine "Machine not ready yet" loop

Hi, I was just working on a Firecracker-based system on AWS when I saw fly machines launch and started talking to @amos. They gave me some encouragement and I ported some code around, pushed a docker image, and tried to launch a machine using the info at Machines. The first image failed because I forgot the entrypoint, the second seemed to have some networking setup issues, and the nth config update is probably something else.

Point being, that I feel like I’m missing something because all of the logs for the issues I’ve been encountering end up not having very much information. Often just looping “Machine not ready yet” over and over. Is there somewhere other than /apps/<app>/monitoring that I should be looking for information about why the machine isn’t ready yet? or another way to debug? The “events” page also seems to be empty.

2022-05-25T08:05:23.294 proxy[3908051df6d875] sjc [info] Machine not ready yet (5.415558368s since start requested)

2022-05-25T08:05:28.419 proxy[3908051df6d875] sjc [info] Machine not ready yet (10.539849608s since start requested)

2022-05-25T08:05:35.454 proxy[3908051df6d875] sjc [info] Machine not ready yet (17.574908223s since start requested)

2022-05-25T08:05:40.547 proxy[3908051df6d875] sjc [info] Machine not ready yet (22.668740114s since start requested)

2022-05-25T08:05:47.531 proxy[3908051df6d875] sjc [info] Machine not ready yet (29.652164892s since start requested)

2022-05-25T08:05:52.670 proxy[3908051df6d875] sjc [info] Machine not ready yet (34.791697834s since start requested)

2022-05-25T08:05:59.737 proxy[3908051df6d875] sjc [info] Machine not ready yet (41.857842707s since start requested)

2022-05-25T08:06:04.863 proxy[3908051df6d875] sjc [info] Machine not ready yet (46.984295216s since start requested)

2022-05-25T08:06:11.929 proxy[3908051df6d875] sjc [info] Machine not ready yet (54.050366837s since start requested)

2022-05-25T08:06:17.054 proxy[3908051df6d875] sjc [info] Machine not ready yet (59.175743782s since start requested)

You can use fly logs -a appname. That should stream stdout from all machines in that app. We should document this! We’re just so used to using flyctl commands that it’s not obvious which ones work with machines.

Hey @chrisbiscardi ! So fly-proxy shows that when it wasn’t able to connect to the machine yet. There could be a couple reasons:

  • Your app is not starting at all? Does it print something to stdout/stderr when it starts listening on some port? (fly logs -a appname after fly m start ...) should show that if it does.
  • Your app is starting but it’s only listening on, say, localhost:8080, not 0.0.0.0:8080 (happened to me recently)
  • Your app is listening on 0.0.0.0 but not on the internal port fly-proxy thinks it is (it defaults to 8080, you can override it: App Configuration (fly.toml) · Fly Docs

The container is definitely running now, I forgot to listen on 0.0.0.0. I’m now more aware of what proxy/runner/etc log names stand for in fly land.

I saw some spam requests come in over-night (before the 0.0.0.0 fix), but I never see any of my own requests make it into the logs.

error.message="App is unhealthy" 2022-05-25T10:58:26Z proxy[3908051df6d875] cdg [error]error.code=2011 request.method="POST" request.url="/cgi-bin/index2.asp" request.id="01G3XDSQQZCRFBXM7W44P75CM1-cdg" response.status=502
error.message="App is unhealthy" 2022-05-25T12:22:33Z proxy[3908051df6d875] nrt [error]error.code=2011 request.method="POST" request.url="/cgi-bin/ViewLog.asp" request.id="01G3XJKRC3F5R5T6FZB04ESP1G-nrt" response.status=502
error.message="App is unhealthy" 2022-05-25T12:29:07Z proxy[3908051df6d875] nrt [error]error.code=2011 request.method="POST" request.url="/cgi-bin/ViewLog.asp" request.id="01G3XJZRW1ZFFPWPQDGQW3Y3W7-nrt" response.status=502
error.message="App is unhealthy" 2022-05-25T13:21:58Z proxy[3908051df6d875] nrt [error]error.code=2011 request.method="POST" request.url="/editBlackAndWhiteList" request.id="01G3XP0H08Q5KBDM3AAGHT0N5S-nrt" response.status=502
error.message="App is unhealthy" 2022-05-25T18:59:18Z proxy[3908051df6d875] nrt [error]error.code=2011 request.method="POST" request.url="/editBlackAndWhiteList" request.id="01G3Y9A7EQ375X20VQ4T035121-nrt" response.status=502

When I try to use the app subdomain (myapp.fly.dev) I get connection closed via error and
when I try to use the ipv4 address, I get Connection reset by peer. Neither request seems to make it into the proxy logs like the spam requests did.

flyctl also seems to be inserting newlines into my API requests at the end of the bearer token, causing subcommands to fail.

> flyctl config save -a myapp
Error Post "https://api.fly.io/graphql": net/http: invalid header field value "Bearer replaced-string\n" for key Authorization

I’m checking out the state of your machine and I see “requested”, not “started”. However the event log shows that the last status change resulted in “started”, so I’m not sure what’s up.

What does fly m status show for you?

I’d maybe try stopping/starting the machine again.

At least I can confirm the proxy doesn’t know about your app currently since it’s not aware of any service existing (due to the machine being in a state I don’t really understand right now). I’m asking other colleagues for help!

Btw, this is, very unfortunately, expected behavior for now, due to how state is propagated from nomad services / fly machines to the proxy. If you scale to zero somehow (by deleting all the machines - stopping them doesn’t count, they still stay in the state) then the proxy loses knowledge of what your app is supposed to serve.

I’ve opened internal issues around this because it frustrates me greatly and I’d like for us to show much MUCH better diagnostics than just resetting the TCP connection. In the meantime I’ll let a machines friend investigate what happened with your machine.

fly m status unfortunately has the same client bearer token issue with the newline.

The UI shows me started.

I’d maybe try stopping/starting the machine again.

I did this via curl and confirmed that the UI and the status endpoint returned stopped, then started again. Still no change in the connection reset behavior.

Btw, this is, very unfortunately , expected behavior for now,

Good to know, thanks for the heads up. I don’t think I ever caused it to scale to zero, I’ve only been stopping/starting/updating since I created it but I’ll keep that in mind. Maybe just trying to boot up a different machine will work.

In the meantime I’ll let a machines friend investigate what happened with your machine.

Thanks, I’d really like to get this working on fly as the AWS solution is going to be far more expensive and complicated to operate.

Where is the token coming from? Is it saved to the ~/.fly/config.yml file, or the FLY_API_TOKEN env variable?

We made a fix for this recently after some people complained about the error on GitHub Actions: trim whitespace when reading FLY_ACCESS_TOKEN by michaeldwan · Pull Request #892 · superfly/flyctl · GitHub.

I checked and it’s because fly auth token returns a token\n, and that’s how I was setting the environment variable.

Hi @chrisbiscardi, sorry for the troubles you’ve hit with the new Machines feature. @amos looped me in this morning to try and figure out why the proxy is not able to route requests to your machine and it looks like the latest configuration you sent to the API did not include a service definition which is why it went missing from the proxy. The first config defined internal port 8080 to be exposed via 443 and 80 using the http and https handlers but the most recent one was completely empty.

Are you using the CLI or API to update the machine?

Another troubleshooting step you can try is running fly proxy 8080:8080 carcinisation.internal -a carcinisation to try hitting your service over 127.0.0.1:8080 locally.

Hey JP,

I’m using the API because that’s the way the docs are written. If my last request wiped out the proxy networking settings, then the fly.io docs shouldn’t remove the networking config from the update example here when the create example has it – Machines · Fly Docs

I expected to be able to update say, an unrelated field like the image, without having to send all of the settings again as the ports are unlikely to change while the image could change quite often.

The API response for the update call also doesn’t include the services key, but does include the config.init, config.env, and other keys I didn’t change. What is the intended response for this API call? Should it show what changed or is it trying to send back all of the settings? Should I treat missing, null, and empty object values as “this got unset”?

I went back and double checked everything and added some extra logs, uploaded a new docker image with the double-checked settings and I sent the full settings object in the API request and it looks like everything is up and running now.

Thanks for taking a look and your time!

That’s a good point, we’ll work to get it updated soon.

The API is declarative because otherwise, we wouldn’t really have a way to know when a user actually wanted something to be unset or added/removed. Our CLI does a very naive merge but first calling the GET endpoint before applying any changes from the user via flags.

The response contains the config as it was applied to the machine. If something is missing, like the services key, then yes, it was unset (and yes, the null/empty objects indicate the same).

This is great to hear! Please do let us know if you hit any other issues.

1 Like