Creating more than one machine on an app fails

nebula · July 5, 2022, 7:26pm

When I try to start more than one machine on an app, the machine is created and then immediately stops.

runner[24d89100a53987] iad [info] Pulling container image
runner[24d89100a53987] iad [info] Unpacking image
runner[24d89100a53987] iad [info] Configuring firecracker

and this is logged, but it never begins logging my application’s logs, so I assume it never attempts to start it.

The first machine boots successfully, but even if I destroy it, I can’t create another one without it being stopped immediately.

(these are booted over REST not GQL this time)

kurt · July 5, 2022, 8:08pm

The 24d89100a53987 exited with code 139 a few seconds after it started. It might’ve exited before logs made it out of the VM. We don’t seem to show that exit code anywhere other than the API, but we should!

The others on that app look like they ran longer, and then were destroyed.

nebula · July 5, 2022, 8:13pm

A few tries ago, I saw it log Failed to pull container image, I wonder if that’s related.

It seems to start successfully intermittently, however this same image booted every time when I was using the GQL endpoint.

nebula · July 5, 2022, 8:37pm

@kurt I was able to reproduce it

runner[591857eeb6d383] iad [info] Pulling container image
runner[591857eeb6d383] iad [info] Pulling image failed
runner[591857eeb6d383] iad [info] Pulling container image
runner[591857eeb6d383] iad [info] Unpacking image
runner[591857eeb6d383] iad [info] Configuring firecracker

I’m not sure if this is related though

kurt · July 5, 2022, 8:46pm

That indicates a retry. That one also stopped when the app process exited with code 139. It looks like the app process is segfaulting.

nebula · July 5, 2022, 9:07pm

I’m seeing this spamming in logs now

runner[06e82d55cde987] iad [info] Pulling container image
runner[06e82d55cde987] iad [info] Unpacking image
runner[39080e11fd0587] iad [info] Pulling container image
runner[39080e11fd0587] iad [info] Unpacking image
runner[73287411ce1085] iad [info] Pulling container image
runner[73287411ce1085] iad [info] Unpacking image
runner[9e784e33f3d483] iad [info] Pulling container image
runner[9e784e33f3d483] iad [info] Unpacking image
runner[73d8d9ddb71589] iad [info] Pulling container image
runner[73d8d9ddb71589] iad [info] Unpacking image
runner[06e82d55cde987] iad [info] Pulling container image
runner[06e82d55cde987] iad [info] Unpacking image
runner[39080e11fd0587] iad [info] Pulling container image
runner[39080e11fd0587] iad [info] Unpacking image
runner[73287411ce1085] iad [info] Pulling container image
runner[73287411ce1085] iad [info] Unpacking image
runner[73d8d9ddb71589] iad [info] Pulling container image
runner[73d8d9ddb71589] iad [info] Unpacking image
runner[06e82d55cde987] iad [info] Pulling container image
runner[06e82d55cde987] iad [info] Unpacking image
runner[9e784e33f3d483] iad [info] Pulling container image
runner[9e784e33f3d483] iad [info] Unpacking image

06e82d55cde987 doesn’t even seem to… exist??

kurt · July 5, 2022, 9:11pm

Are those timestamps identical? 06e82d55cde987 is a machine on a now deleted app with the same name. This shouldn’t affect how machines run, but I’m curious how that got attached to a deleted app.

The repeated image pulls might be Docker Hub rate limiting us. Is this a public image you’re using, or did you push it to our repository? Our repository is going to be way more reliable.

nebula · July 5, 2022, 9:13pm

It’s meant to be a public image, so I’m hosting it on Docker Hub.
Is there a way to use the Fly registry to host images that anyone on Fly can pull?

nebula · July 5, 2022, 9:13pm

Some logs with timestamps:

2022-07-05T21:13:10.181 runner[39080e11fd0587] iad [info] Pulling container image

2022-07-05T21:13:10.242 runner[39080e11fd0587] iad [info] Unpacking image

2022-07-05T21:13:11.024 runner[06e82d55cde987] iad [info] Pulling container image

2022-07-05T21:13:11.030 runner[73d8d9ddb71589] iad [info] Pulling container image

2022-07-05T21:13:11.114 runner[06e82d55cde987] iad [info] Unpacking image

2022-07-05T21:13:11.116 runner[73d8d9ddb71589] iad [info] Unpacking image

2022-07-05T21:13:11.943 runner[9e784e33f3d483] iad [info] Pulling container image

2022-07-05T21:13:12.028 runner[9e784e33f3d483] iad [info] Unpacking image

2022-07-05T21:13:12.365 runner[73287411ce1085] iad [info] Pulling container image

2022-07-05T21:13:12.390 runner[39080e11fd0587] iad [info] Pulling container image

2022-07-05T21:13:12.430 runner[73287411ce1085] iad [info] Unpacking image

2022-07-05T21:13:12.475 runner[39080e11fd0587] iad [info] Unpacking image

2022-07-05T21:13:12.714 runner[73d8d9ddb71589] iad [info] Pulling container image

2022-07-05T21:13:12.799 runner[73d8d9ddb71589] iad [info] Unpacking image

2022-07-05T21:13:13.771 runner[06e82d55cde987] iad [info] Pulling container image

2022-07-05T21:13:13.878 runner[06e82d55cde987] iad [info] Unpacking image

2022-07-05T21:13:14.451 runner[39080e11fd0587] iad [info] Pulling container image

2022-07-05T21:13:14.537 runner[39080e11fd0587] iad [info] Unpacking image

2022-07-05T21:13:14.973 runner[9e784e33f3d483] iad [info] Pulling container image

2022-07-05T21:13:15.032 runner[9e784e33f3d483] iad [info] Unpacking image

2022-07-05T21:13:15.316 runner[73287411ce1085] iad [info] Pulling container image

2022-07-05T21:13:15.336 runner[73d8d9ddb71589] iad [info] Pulling container image

2022-07-05T21:13:15.412 runner[73d8d9ddb71589] iad [info] Unpacking image

2022-07-05T21:13:15.426 runner[73287411ce1085] iad [info] Unpacking image

2022-07-05T21:13:16.053 runner[06e82d55cde987] iad [info] Pulling container image

2022-07-05T21:13:16.128 runner[06e82d55cde987] iad [info] Unpacking image

2022-07-05T21:13:16.923 runner[9e784e33f3d483] iad [info] Pulling container image

2022-07-05T21:13:17.237 runner[9e784e33f3d483] iad [info] Unpacking image

2022-07-05T21:13:17.364 runner[73287411ce1085] iad [info] Pulling container image

2022-07-05T21:13:17.442 runner[73287411ce1085] iad [info] Unpacking image

2022-07-05T21:13:17.451 runner[39080e11fd0587] iad [info] Pulling container image

2022-07-05T21:13:17.577 runner[39080e11fd0587] iad [info] Unpacking image

2022-07-05T21:13:18.557 runner[73d8d9ddb71589] iad [info] Pulling container image

2022-07-05T21:13:18.618 runner[9e784e33f3d483] iad [info] Pulling container image

2022-07-05T21:13:18.622 runner[73d8d9ddb71589] iad [info] Unpacking image

2022-07-05T21:13:18.680 runner[9e784e33f3d483] iad [info] Unpacking image

2022-07-05T21:13:19.115 app[24d896d1a76587] iad [info] 2022/07/05 21:13:19 v1.26.1-tb3687b707-gbcd70ce32 peers: 27568/42536

2022-07-05T21:13:19.115 app[24d896d1a76587] iad [info] 2022/07/05 21:13:19 [sspZy] - Receiving keepalive packet

2022-07-05T21:13:19.125 runner[06e82d55cde987] iad [info] Pulling container image

2022-07-05T21:13:19.191 runner[06e82d55cde987] iad [info] Unpacking image

2022-07-05T21:13:19.491 runner[73287411ce1085] iad [info] Pulling container image

2022-07-05T21:13:19.567 runner[73287411ce1085] iad [info] Unpacking image

2022-07-05T21:13:19.675 runner[39080e11fd0587] iad [info] Pulling container image

2022-07-05T21:13:19.752 runner[39080e11fd0587] iad [info] Unpacking image

2022-07-05T21:13:20.655 runner[9e784e33f3d483] iad [info] Pulling container image

2022-07-05T21:13:20.729 runner[9e784e33f3d483] iad [info] Unpacking image

2022-07-05T21:13:21.021 runner[73287411ce1085] iad [info] Pulling container image

2022-07-05T21:13:21.119 runner[73287411ce1085] iad [info] Unpacking image

2022-07-05T21:13:21.579 runner[73d8d9ddb71589] iad [info] Pulling container image

2022-07-05T21:13:21.774 runner[73d8d9ddb71589] iad [info] Unpacking image

2022-07-05T21:13:21.978 runner[06e82d55cde987] iad [info] Pulling container image

2022-07-05T21:13:22.061 runner[06e82d55cde987] iad [info] Unpacking image

2022-07-05T21:13:22.465 runner[39080e11fd0587] iad [info] Pulling container image

2022-07-05T21:13:22.532 runner[39080e11fd0587] iad [info] Unpacking image

kurt · July 5, 2022, 9:20pm

Huh, it was in a repeated loop trying to pull. I think it might’ve just worked though?

Using Docker Hub is fine for something you need to share. It’s a problem if something goes awry and we have to pull a bunch, but you’ll notice that more while doing dev than users will I think.

nebula · July 5, 2022, 9:23pm

How would I go about killing the machine on the deleted app? I thought they would be cleaned up automatically when the app was deleted.

kurt · July 5, 2022, 9:27pm

It should have been, we’ll get it cleaned up and figure out what went wrong.

nebula · July 5, 2022, 9:27pm

Changed my ENTRYPOINT to launch || echo $? && sleep 10 but nothing is logged and the machine just exits again

machine id: 73d8d9ddb70589

kurt · July 5, 2022, 9:30pm

That also exited with code 139. Which is interesting. ENTRYPOINT can’t evaluate shell scripts, it has to run a binary. Which makes me think the “segfault” might be an error finding the executable it’s meant to run.

It’s also possible it didn’t pick up the new ENTRYPOINT if you used the same image tag. Maybe give it another try with a unique image tag and a real binary for entrypoint?

nebula · July 5, 2022, 9:31pm

The full ENTRYPOINT was ENTRYPOINT [ "/bin/bash", "-c", "launch || echo $? && sleep 10" ]
I’ll try using a different tag.

nebula · July 5, 2022, 9:34pm

New tag, same result

2022-07-05T21:32:59.144 runner[06e82d55cd3987] iad [info] Pulling container image
2022-07-05T21:32:59.161 runner[06e82d55cd3987] iad [info] Pulling image failed
2022-07-05T21:32:59.223 runner[06e82d55cd3987] iad [info] Pulling container image
2022-07-05T21:32:59.342 runner[06e82d55cd3987] iad [info] Unpacking image
2022-07-05T21:32:59.554 runner[06e82d55cd3987] iad [info] Configuring firecracker

kurt · July 5, 2022, 9:38pm

Are these safe to mess with? I think we may be able to get some more interesting logs out of them.

nebula · July 5, 2022, 9:39pm

Yeah defo, there’s nothing important running on these, they’re just testing machines for a tool I’m writing.
What’s on your mind?

kurt · July 5, 2022, 11:15pm

I have a feeling there’s a bug in how we parse / use ENDPOINTS. I know for sure we’re not showing you all the logs you need, too.

kurt · July 6, 2022, 12:42am

What is the base for this image? I think it might have been failing because it was sometimes landing on hardware we run that has an older host kernel (5.6). It was likely working on 5.11. More context here: Segmentation fault no matter what i have in my Dockerfile

We’re probably a week away from getting the remaining host kernels upgraded. For now, you can try an older base image and see if it runs. We’re also going to get the error logs delivered more reliably. This would have taken about 10 min to diagnose.

Topic		Replies	Views
Creating machines via Machines API is very slow compared to "fly scale count"	3	57	October 16, 2024
remote builder machine stuck in created state	10	2146	February 1, 2023
Machine restarts fail after successful initial start (an unhandled IO error occurred: File exists (os error 17)) Questions / Help machines	6	211	June 20, 2024
You have reached the maximum number of machines for this app	4	662	June 6, 2023
App down, machines still not starting after incident Questions / Help	1	81	April 19, 2024

Creating more than one machine on an app fails

Related topics