Monitoring page health checks

When we created the Monitoring page for your app we wanted to give you a good place to reason about possible issues your app might be going through. Though we realized one thing: if your machine started but the checks didn’t pass you’d not know something terrible was happening. You’d only be able to know that if you spotted an issue on your app or you used fly checks list but the Monitoring page would just have a single :white_check_mark: on each machine.

As you can see in the picture above these machines are started but our checks detected two of them are not working just right.

We added a new collapsible section to machines with checks. And just in case you don’t know, machine apps and new postgres apps come with that already. The picture above is a PG of mine and here’s an app I’ve converted from Nomad to V2:

Did you know you can create custom checks by the way? All you have to do is add a section on your fly.toml

Let us know your thoughts on this and how can we make health checks better!

9 Likes

Thanks! This is a great improvement.

Do you know if there are any plans to support script_checks in Apps V2? I found this useful for monitoring the health of Sidekiq, which doesn’t provide an HTTP health endpoint, so I was sad to see that it stopped working when I migrated to V2.

2 Likes

Thanks for your feedback.

We are not planning on building support script_checks for apps v2.

May I ask what else you tried for Sidekiq checking? We’d love to understand what you need to monitor with Sidekiq.

1 Like

This is good to know. Thanks for the clarity!

I haven’t yet had time to explore other options for Sidekiq health checks. I was previously using script_checks to run bundle exec sidekiqmon processes and inspect its output to determine whether Sidekiq was healthy. The output looks like this:

---- Processes (1) ----
9080e6e5a3e087:534:4a42e7f24e69 [mastodon]
  Started: 2023-04-14 18:47:28 +0000 (11 minutes ago)
  Threads: 5 (0 busy)
   Queues: default, ingress, mailers, pull, push, scheduler

Now that I know script_checks isn’t coming back, I may look into running a small HTTP server process that calls sidekiqmon under the hood.

If that helps, here’s how we do for postgres health check: postgres-flex/fly.toml at 6fa0e25bfbcc02dc1bfc4376db54bf0c9ff317cf · fly-apps/postgres-flex · GitHub

2 Likes

I got a KeyDB up and running via this repo GitHub - fly-apps/keydb: KeyDB server on Fly works fine and it’s also checking and connecting to peers so I thought services.script_checks in fly.toml seems to work but now I realize it’s probably hivemind which does the initial checks so it’s probably not checking continuously right?
[checks] from the postgres example looks interesting though I’d need an endpoint, quickest workaround I can think of would be provide an API on my web app which uses KeyDB, this endpoint could check if everything is fine, but this feels a little weird, back and forth and [[services.script_checks]] was a really nice way.

What other not too complex options do I have for replacing script_checks?
I was just about to use them to have a simple S3 backup strategy :grimacing:

Edit: Looking into how to achieve the same from keydb/fly/start_keydb.sh at main · fly-apps/keydb (github.com) and Running Multiple Processes Inside A Fly.io App · Fly Docs

update: haha now I realize this is the reason it works keydb/fly/detect_peers_periodically.sh at main · fly-apps/keydb (github.com), probably just doing the same for now :fire:

1 Like

What I’m still not understanding fully is how healthchecks work in this case, I’ve already deployed a couple erroring KeyDB instances, which fly deploy considered successful, but fly log showed that they were restarting all the time due to misconfiguration :grimacing:

I had the following initially in my fly.toml can’t remember where this came from, maybe fly launch as it’s not in the [repo]( keydb/fly.toml at main · fly-apps/keydb (github.com))?

[[services.script_checks]]
  interval = 5000
  timeout = 1000
  command = "/fly/check_ready.sh"
  restart_limit = 0

but this isn’t working in v2 anymore right? So how would I do a healthcheck without http accessible endpoint?

As of now we don’t have support for script_checkd in apps v2 and it’s not on our roadmap so far so the suggestion would be creating a simple HTTP server with any language you’d like.

For our postgres we created this helper that makes things easier: