Server down: "can't run hallpass"

Got a message from my website monitoring service that my sm-beta app went down. flyctl logs shows that the container keeps restarting but isn’t coming online. There’s a new “can’t run hallpass” error that I’ve never seen before. The server seems to be starting up OK, but the process gets killed anyway. Here’s some of the logs:

2021-02-04T02:01:12.391Z 8a683ee9 vin [info] Strum Machine v333.0 now serving beta.strummachine.com on 8a683ee9
2021-02-04T02:01:32.505Z 8a683ee9 vin [info] Health check status changed to 'critical'
2021-02-04T02:01:37.329Z 8a683ee9 vin [info] Shutting down virtual machine
2021-02-04T02:01:53.091Z 8a683ee9 vin [info] Sending signal SIGTERM to main child process w/ PID 504
2021-02-04T02:01:54.094Z 8a683ee9 vin [info] Reaped child process with pid: 504 and signal: SIGTERM
2021-02-04T02:01:56.074Z 8a683ee9 vin [info] Starting instance
2021-02-04T02:01:56.111Z 8a683ee9 vin [info] Configuring virtual machine
2021-02-04T02:01:56.113Z 8a683ee9 vin [info] Pulling container image
2021-02-04T02:01:56.310Z 8a683ee9 vin [info] Unpacking image
2021-02-04T02:01:56.326Z 8a683ee9 vin [info] Preparing kernel init
2021-02-04T02:01:56.648Z 8a683ee9 vin [info] Configuring firecracker
2021-02-04T02:01:56.670Z 8a683ee9 vin [info] Starting virtual machine
2021-02-04T02:01:56.909Z 8a683ee9 vin [info] Starting init (commit: 882f7ea)...
2021-02-04T02:01:56.943Z 8a683ee9 vin [info] Running: `docker-entrypoint.sh node main.js` as root
2021-02-04T02:01:56.956Z 8a683ee9 vin [info] can't run hallpass: an unhandled IO error occurred: 0
2021-02-04T02:01:58.187Z 8a683ee9 vin [info] Health check status changed to 'passing'
2021-02-04T02:02:02.925Z 8a683ee9 vin [info] Monti APM: Successfully connected
2021-02-04T02:02:06.660Z 8a683ee9 vin [info] Starting Meteor server on 8a683ee9
2021-02-04T02:02:07.378Z 8a683ee9 vin [info] Monti APM: completed instrumenting the app
2021-02-04T02:02:07.916Z 8a683ee9 vin [info] Strum Machine v333.0 now serving beta.strummachine.com on 8a683ee9
2021-02-04T02:02:11.905Z 8a683ee9 vin [info] Health check status changed to 'critical'
2021-02-04T02:02:32.703Z 8a683ee9 vin [info] Shutting down virtual machine
2021-02-04T02:02:32.727Z 8a683ee9 vin [info] Sending signal SIGTERM to main child process w/ PID 504
2021-02-04T02:02:33.732Z 8a683ee9 vin [info] Reaped child process with pid: 504 and signal: SIGTERM
2021-02-04T02:02:37.482Z 8a683ee9 vin [info] Health check status changed to 'passing'
2021-02-04T02:02:38.145Z 8a683ee9 vin [info] Starting instance
2021-02-04T02:02:38.184Z 8a683ee9 vin [info] Configuring virtual machine
2021-02-04T02:02:38.185Z 8a683ee9 vin [info] Pulling container image
2021-02-04T02:02:38.425Z 8a683ee9 vin [info] Unpacking image
2021-02-04T02:02:38.434Z 8a683ee9 vin [info] Preparing kernel init
2021-02-04T02:02:38.769Z 8a683ee9 vin [info] Configuring firecracker
2021-02-04T02:02:38.793Z 8a683ee9 vin [info] Starting virtual machine
2021-02-04T02:02:38.996Z 8a683ee9 vin [info] Starting init (commit: 882f7ea)...
2021-02-04T02:02:39.032Z 8a683ee9 vin [info] Running: `docker-entrypoint.sh node main.js` as root
2021-02-04T02:02:39.047Z 8a683ee9 vin [info] can't run hallpass: an unhandled IO error occurred: 0
2021-02-04T02:02:45.093Z 8a683ee9 vin [info] Monti APM: Successfully connected
2021-02-04T02:02:48.808Z 8a683ee9 vin [info] Starting Meteor server on 8a683ee9
2021-02-04T02:02:49.608Z 8a683ee9 vin [info] Monti APM: completed instrumenting the app
2021-02-04T02:02:50.208Z 8a683ee9 vin [info] Strum Machine v333.0 now serving beta.strummachine.com on 8a683ee9
2021-02-04T02:02:50.697Z 8a683ee9 vin [info] Health check status changed to 'critical'
2021-02-04T02:03:15.006Z 8a683ee9 vin [info] Shutting down virtual machine
2021-02-04T02:03:15.421Z 8a683ee9 vin [info] Sending signal SIGTERM to main child process w/ PID 504
2021-02-04T02:03:16.425Z 8a683ee9 vin [info] Reaped child process with pid: 504 and signal: SIGTERM
[...and so on...]

I haven’t changed anything or touched this deployment in days so I’m not sure why it suddenly decided to start doing this.

This is just the beta site so it’s not mission-critical but it’s very concerning… would be very bad if this had happened on the prod site!

I ran flyctl deploy -i registry.fly.io/sm-server:v333.0 to see if a simple re-deploy would fix it, but it got stuck at “v7 is being deployed” and nothing showed up in flyctl logs. So I ran flyctl scale count 2 to add another instance and a new instance started up in ewr and it did start up… after THREE server boot-ups:


2021-02-04T02:31:40.234Z 81e20d7f ewr [info] Starting instance
2021-02-04T02:31:40.253Z 81e20d7f ewr [info] Configuring virtual machine
2021-02-04T02:31:40.254Z 81e20d7f ewr [info] Pulling container image
2021-02-04T02:31:44.265Z 81e20d7f ewr [info] Unpacking image
2021-02-04T02:31:50.633Z 81e20d7f ewr [info] Preparing kernel init
2021-02-04T02:31:51.010Z 81e20d7f ewr [info] Configuring firecracker
2021-02-04T02:31:51.028Z 81e20d7f ewr [info] Starting virtual machine
2021-02-04T02:31:51.141Z 81e20d7f ewr [info] Starting init (commit: 882f7ea)...
2021-02-04T02:31:51.162Z 81e20d7f ewr [info] Running: `docker-entrypoint.sh node main.js` as root
2021-02-04T02:31:51.168Z 81e20d7f ewr [info] can't run hallpass: an unhandled IO error occurred: 0
2021-02-04T02:31:53.547Z 81e20d7f ewr [info] Monti APM: Successfully connected
2021-02-04T02:31:56.503Z 81e20d7f ewr [info] Starting Meteor server on 81e20d7f
2021-02-04T02:31:56.974Z 81e20d7f ewr [info] Monti APM: completed instrumenting the app
2021-02-04T02:31:57.303Z 81e20d7f ewr [info] Strum Machine v333.0 now serving beta.strummachine.com on 81e20d7f
2021-02-04T02:32:27.289Z 81e20d7f ewr [info] Shutting down virtual machine
2021-02-04T02:32:27.310Z 81e20d7f ewr [info] Sending signal SIGTERM to main child process w/ PID 504
2021-02-04T02:32:28.316Z 81e20d7f ewr [info] Reaped child process with pid: 504 and signal: SIGTERM
2021-02-04T02:32:30.125Z 81e20d7f ewr [info] Health check status changed to 'critical'
2021-02-04T02:32:30.253Z 81e20d7f ewr [info] Starting instance
2021-02-04T02:32:30.271Z 81e20d7f ewr [info] Pulling container image
2021-02-04T02:32:30.271Z 81e20d7f ewr [info] Configuring virtual machine
2021-02-04T02:32:30.488Z 81e20d7f ewr [info] Unpacking image
2021-02-04T02:32:30.491Z 81e20d7f ewr [info] Preparing kernel init
2021-02-04T02:32:30.845Z 81e20d7f ewr [info] Configuring firecracker
2021-02-04T02:32:30.861Z 81e20d7f ewr [info] Starting virtual machine
2021-02-04T02:32:30.976Z 81e20d7f ewr [info] Starting init (commit: 882f7ea)...
2021-02-04T02:32:30.995Z 81e20d7f ewr [info] Running: `docker-entrypoint.sh node main.js` as root
2021-02-04T02:32:31.002Z 81e20d7f ewr [info] can't run hallpass: an unhandled IO error occurred: 0
2021-02-04T02:32:33.104Z 81e20d7f ewr [info] Monti APM: Successfully connected
2021-02-04T02:32:36.235Z 81e20d7f ewr [info] Starting Meteor server on 81e20d7f
2021-02-04T02:32:36.711Z 81e20d7f ewr [info] Monti APM: completed instrumenting the app
2021-02-04T02:32:37.075Z 81e20d7f ewr [info] Strum Machine v333.0 now serving beta.strummachine.com on 81e20d7f
2021-02-04T02:33:06.892Z 81e20d7f ewr [info] Shutting down virtual machine
2021-02-04T02:33:06.910Z 81e20d7f ewr [info] Sending signal SIGTERM to main child process w/ PID 504
2021-02-04T02:33:07.912Z 81e20d7f ewr [info] Reaped child process with pid: 504 and signal: SIGTERM
2021-02-04T02:33:10.121Z 81e20d7f ewr [info] Starting instance
2021-02-04T02:33:10.138Z 81e20d7f ewr [info] Configuring virtual machine
2021-02-04T02:33:10.139Z 81e20d7f ewr [info] Pulling container image
2021-02-04T02:33:10.337Z 81e20d7f ewr [info] Unpacking image
2021-02-04T02:33:10.340Z 81e20d7f ewr [info] Preparing kernel init
2021-02-04T02:33:10.682Z 81e20d7f ewr [info] Configuring firecracker
2021-02-04T02:33:10.697Z 81e20d7f ewr [info] Starting virtual machine
2021-02-04T02:33:10.828Z 81e20d7f ewr [info] Starting init (commit: 882f7ea)...
2021-02-04T02:33:10.852Z 81e20d7f ewr [info] Running: `docker-entrypoint.sh node main.js` as root
2021-02-04T02:33:10.859Z 81e20d7f ewr [info] can't run hallpass: an unhandled IO error occurred: 0
2021-02-04T02:33:13.181Z 81e20d7f ewr [info] Monti APM: Successfully connected
2021-02-04T02:33:16.158Z 81e20d7f ewr [info] Starting Meteor server on 81e20d7f
2021-02-04T02:33:16.630Z 81e20d7f ewr [info] Monti APM: completed instrumenting the app
2021-02-04T02:33:16.965Z 81e20d7f ewr [info] Strum Machine v333.0 now serving beta.strummachine.com on 81e20d7f
2021-02-04T02:33:18.345Z 81e20d7f ewr [info] Health check status changed to 'passing'
2021-02-04T02:33:25.139Z 81e20d7f ewr [info] Health check status changed to 'warning'
2021-02-04T02:33:28.227Z 81e20d7f ewr [info] Health check status changed to 'passing'

Any idea what’s going on?

(Side note: any way to grab logs going back farther in time?)

On it right now. Thanks!

1 Like

I don’t know if this is related or not, but for exploration purposes I tried suspending and resuming my sm-sandbox test server, and then it got stuck so I bumped the count up to 2 instances. It won’t start. See the bottom of these logs, where the health check on both instances is ‘passing’ before it starts failing. I doubled the grace_period just to be sure that it was having enough time to start. This is the same image that is running just fine (for the time being) on sm-prod, and was OK on sm-beta until recently…

2021-02-04T03:24:59.343Z d88e554c sjc [info] Shutting down virtual machine
2021-02-04T03:24:59.419Z d88e554c sjc [info] Sending signal SIGTERM to main child process w/ PID 504
2021-02-04T03:25:00.421Z d88e554c sjc [info] Reaped child process with pid: 504 and signal: SIGTERM
2021-02-04T03:25:02.349Z d88e554c sjc [info] Starting instance
2021-02-04T03:25:02.369Z d88e554c sjc [info] Configuring virtual machine
2021-02-04T03:25:02.371Z d88e554c sjc [info] Pulling container image
2021-02-04T03:25:02.805Z d88e554c sjc [info] Unpacking image
2021-02-04T03:25:02.810Z d88e554c sjc [info] Preparing kernel init
2021-02-04T03:25:03.201Z d88e554c sjc [info] Configuring firecracker
2021-02-04T03:25:03.279Z d88e554c sjc [info] Starting virtual machine
2021-02-04T03:25:03.406Z d88e554c sjc [info] Starting init (commit: 882f7ea)...
2021-02-04T03:25:03.430Z d88e554c sjc [info] Running: `docker-entrypoint.sh node main.js` as root
2021-02-04T03:25:03.437Z d88e554c sjc [info] can't run hallpass: an unhandled IO error occurred: 0
2021-02-04T03:25:06.115Z d88e554c sjc [info] Monti APM: Successfully connected
2021-02-04T03:25:21.725Z d88e554c sjc [info] Shutting down virtual machine
2021-02-04T03:25:21.802Z d88e554c sjc [info] Sending signal SIGTERM to main child process w/ PID 504
2021-02-04T03:25:22.804Z d88e554c sjc [info] Reaped child process with pid: 504 and signal: SIGTERM
2021-02-04T03:25:25.308Z f66021b1 vin [info] Starting instance
2021-02-04T03:25:25.345Z f66021b1 vin [info] Configuring virtual machine
2021-02-04T03:25:25.346Z f66021b1 vin [info] Pulling container image
2021-02-04T03:25:25.566Z f66021b1 vin [info] Unpacking image
2021-02-04T03:25:25.575Z f66021b1 vin [info] Preparing kernel init
2021-02-04T03:25:25.877Z f66021b1 vin [info] Configuring firecracker
2021-02-04T03:25:26.038Z f66021b1 vin [info] Starting virtual machine
2021-02-04T03:25:26.198Z f66021b1 vin [info] Starting init (commit: 882f7ea)...
2021-02-04T03:25:26.228Z f66021b1 vin [info] Running: `docker-entrypoint.sh node main.js` as root
2021-02-04T03:25:26.239Z f66021b1 vin [info] can't run hallpass: an unhandled IO error occurred: 0
2021-02-04T03:25:26.470Z d88e554c sjc [info] Health check status changed to 'passing'
2021-02-04T03:25:28.445Z 25d4d56e ewr [info] Starting instance
2021-02-04T03:25:28.462Z 25d4d56e ewr [info] Configuring virtual machine
2021-02-04T03:25:28.463Z 25d4d56e ewr [info] Pulling container image
2021-02-04T03:25:28.595Z 25d4d56e ewr [info] Unpacking image
2021-02-04T03:25:28.598Z 25d4d56e ewr [info] Preparing kernel init
2021-02-04T03:25:28.955Z 25d4d56e ewr [info] Configuring firecracker
2021-02-04T03:25:28.969Z 25d4d56e ewr [info] Starting virtual machine
2021-02-04T03:25:29.098Z 25d4d56e ewr [info] Starting init (commit: 882f7ea)...
2021-02-04T03:25:29.099Z f66021b1 vin [info] Monti APM: Successfully connected
2021-02-04T03:25:29.119Z 25d4d56e ewr [info] Running: `docker-entrypoint.sh node main.js` as root
2021-02-04T03:25:29.126Z 25d4d56e ewr [info] can't run hallpass: an unhandled IO error occurred: 0
2021-02-04T03:25:31.256Z 25d4d56e ewr [info] Monti APM: Successfully connected
2021-02-04T03:25:55.266Z f66021b1 vin [info] Starting Meteor server on f66021b1
2021-02-04T03:25:56.106Z f66021b1 vin [info] Monti APM: completed instrumenting the app
2021-02-04T03:25:56.536Z f66021b1 vin [info] Strum Machine v333.0 now serving sandbox.strummachine.com on f66021b1
2021-02-04T03:25:58.296Z 25d4d56e ewr [info] Starting Meteor server on 25d4d56e
2021-02-04T03:25:59.004Z 25d4d56e ewr [info] Monti APM: completed instrumenting the app
2021-02-04T03:25:59.396Z 25d4d56e ewr [info] Strum Machine v333.0 now serving sandbox.strummachine.com on 25d4d56e
2021-02-04T03:26:04.042Z f66021b1 vin [info] Shutting down virtual machine
2021-02-04T03:26:04.064Z f66021b1 vin [info] Sending signal SIGTERM to main child process w/ PID 504
2021-02-04T03:26:05.074Z f66021b1 vin [info] Reaped child process with pid: 504 and signal: SIGTERM
2021-02-04T03:26:05.246Z 25d4d56e ewr [info] Shutting down virtual machine
2021-02-04T03:26:05.264Z 25d4d56e ewr [info] Sending signal SIGTERM to main child process w/ PID 504
2021-02-04T03:26:06.266Z 25d4d56e ewr [info] Reaped child process with pid: 504 and signal: SIGTERM
2021-02-04T03:26:07.419Z f66021b1 vin [info] Starting instance
2021-02-04T03:26:07.455Z f66021b1 vin [info] Configuring virtual machine
2021-02-04T03:26:07.457Z f66021b1 vin [info] Pulling container image
2021-02-04T03:26:07.612Z f66021b1 vin [info] Unpacking image
2021-02-04T03:26:07.626Z f66021b1 vin [info] Preparing kernel init
2021-02-04T03:26:07.867Z f66021b1 vin [info] Configuring firecracker
2021-02-04T03:26:07.892Z f66021b1 vin [info] Starting virtual machine
2021-02-04T03:26:08.043Z f66021b1 vin [info] Starting init (commit: 882f7ea)...
2021-02-04T03:26:08.079Z f66021b1 vin [info] Running: `docker-entrypoint.sh node main.js` as root
2021-02-04T03:26:08.088Z f66021b1 vin [info] can't run hallpass: an unhandled IO error occurred: 0
2021-02-04T03:26:08.345Z 25d4d56e ewr [info] Starting instance
2021-02-04T03:26:08.361Z 25d4d56e ewr [info] Configuring virtual machine
2021-02-04T03:26:08.362Z 25d4d56e ewr [info] Pulling container image
2021-02-04T03:26:08.569Z 25d4d56e ewr [info] Unpacking image
2021-02-04T03:26:08.574Z 25d4d56e ewr [info] Preparing kernel init
2021-02-04T03:26:08.920Z 25d4d56e ewr [info] Configuring firecracker
2021-02-04T03:26:08.939Z 25d4d56e ewr [info] Starting virtual machine
2021-02-04T03:26:09.056Z 25d4d56e ewr [info] Starting init (commit: 882f7ea)...
2021-02-04T03:26:09.076Z 25d4d56e ewr [info] Running: `docker-entrypoint.sh node main.js` as root
2021-02-04T03:26:09.083Z 25d4d56e ewr [info] can't run hallpass: an unhandled IO error occurred: 0
2021-02-04T03:26:11.291Z f66021b1 vin [info] Monti APM: Successfully connected
2021-02-04T03:26:11.394Z 25d4d56e ewr [info] Monti APM: Successfully connected
2021-02-04T03:26:16.603Z f66021b1 vin [info] Health check status changed to 'passing'
2021-02-04T03:26:17.740Z 25d4d56e ewr [info] Health check status changed to 'passing'
2021-02-04T03:26:24.640Z 25d4d56e ewr [info] Health check status changed to 'critical'
2021-02-04T03:26:26.599Z f66021b1 vin [info] Health check status changed to 'warning'
2021-02-04T03:26:39.801Z f66021b1 vin [info] Health check status changed to 'critical'
2021-02-04T03:26:44.549Z f66021b1 vin [info] Shutting down virtual machine
2021-02-04T03:26:44.576Z f66021b1 vin [info] Sending signal SIGTERM to main child process w/ PID 504
2021-02-04T03:26:44.850Z 25d4d56e ewr [info] Shutting down virtual machine
2021-02-04T03:26:44.869Z 25d4d56e ewr [info] Sending signal SIGTERM to main child process w/ PID 504
2021-02-04T03:26:45.580Z f66021b1 vin [info] Reaped child process with pid: 504 and signal: SIGTERM
2021-02-04T03:26:45.871Z 25d4d56e ewr [info] Reaped child process with pid: 504 and signal: SIGTERM

We’re still investigating this, so far it looks like two things happened:

  1. The host your app was on failed around when your app went down
  2. New VMs kept failing to come up
  3. The “hallpass” error is new, but unrelated

Most of these issues were in our VIN region, I’ve moved your app out of there and it seems to be working pretty well.

I’m digging through logs to see why it wasn’t coming up. You can see older logs per instance by running:

flyctl status --all
flyctl logs -i <id>

Out of curiosity, do your health checks trigger external network connections/requests?

Thanks for being so responsive!

The health check is just an HTTP to my /healthz url, which is a static server route that always returns 200 OK. No external connections; it shows a different message if it can’t connect to the database server, but it still returns 200 OK (and the health check doesn’t look at the contents of the response).

Just a quick update:

The hallpass thing shouldn’t have affected your app, beyond spitting some annoying logs. But just in case you’re interested:

hallpass is our new SSH server thing. It’s written in Go. The CI/CD build process for it missed a CGO_ENABLED=0 — in other words, it was allowed to build with CGO enabled, which, on Linux, for things that use the DNS, creates a dynamically linked executable. Linked, of course, to a library your instance didn’t have.

This should be fixed fleetwide. We’re still digging into what happened with VIN.

You’re right: it’s not OK to not be able to start new instances, or to have to try a whole bunch of times to get it working!