Monitoring page health checks

lubien · April 14, 2023, 6:09pm

When we created the Monitoring page for your app we wanted to give you a good place to reason about possible issues your app might be going through. Though we realized one thing: if your machine started but the checks didn’t pass you’d not know something terrible was happening. You’d only be able to know that if you spotted an issue on your app or you used fly checks list but the Monitoring page would just have a single on each machine.

As you can see in the picture above these machines are started but our checks detected two of them are not working just right.

We added a new collapsible section to machines with checks. And just in case you don’t know, machine apps and new postgres apps come with that already. The picture above is a PG of mine and here’s an app I’ve converted from Nomad to V2:

Did you know you can create custom checks by the way? All you have to do is add a section on your fly.toml

Let us know your thoughts on this and how can we make health checks better!

rgrove · April 14, 2023, 6:30pm

Thanks! This is a great improvement.

Do you know if there are any plans to support script_checks in Apps V2? I found this useful for monitoring the health of Sidekiq, which doesn’t provide an HTTP health endpoint, so I was sad to see that it stopped working when I migrated to V2.

lubien · April 14, 2023, 6:46pm

Thanks for your feedback.

We are not planning on building support script_checks for apps v2.

May I ask what else you tried for Sidekiq checking? We’d love to understand what you need to monitor with Sidekiq.

rgrove · April 14, 2023, 7:05pm

This is good to know. Thanks for the clarity!

I haven’t yet had time to explore other options for Sidekiq health checks. I was previously using script_checks to run bundle exec sidekiqmon processes and inspect its output to determine whether Sidekiq was healthy. The output looks like this:

---- Processes (1) ----
9080e6e5a3e087:534:4a42e7f24e69 [mastodon]
  Started: 2023-04-14 18:47:28 +0000 (11 minutes ago)
  Threads: 5 (0 busy)
   Queues: default, ingress, mailers, pull, push, scheduler

Now that I know script_checks isn’t coming back, I may look into running a small HTTP server process that calls sidekiqmon under the hood.

lubien · April 14, 2023, 7:25pm

If that helps, here’s how we do for postgres health check: postgres-flex/fly.toml at 6fa0e25bfbcc02dc1bfc4376db54bf0c9ff317cf · fly-apps/postgres-flex · GitHub

CanRau · July 31, 2023, 5:20pm

I got a KeyDB up and running via this repo GitHub - fly-apps/keydb: KeyDB server on Fly works fine and it’s also checking and connecting to peers so I thought services.script_checks in fly.toml seems to work but now I realize it’s probably hivemind which does the initial checks so it’s probably not checking continuously right?
[checks] from the postgres example looks interesting though I’d need an endpoint, quickest workaround I can think of would be provide an API on my web app which uses KeyDB, this endpoint could check if everything is fine, but this feels a little weird, back and forth and [[services.script_checks]] was a really nice way.

What other not too complex options do I have for replacing script_checks?
I was just about to use them to have a simple S3 backup strategy

Edit: Looking into how to achieve the same from keydb/fly/start_keydb.sh at main · fly-apps/keydb (github.com) and Running Multiple Processes Inside A Fly.io App · Fly Docs

update: haha now I realize this is the reason it works keydb/fly/detect_peers_periodically.sh at main · fly-apps/keydb (github.com), probably just doing the same for now

CanRau · August 2, 2023, 1:48am

What I’m still not understanding fully is how healthchecks work in this case, I’ve already deployed a couple erroring KeyDB instances, which fly deploy considered successful, but fly log showed that they were restarting all the time due to misconfiguration

I had the following initially in my fly.toml can’t remember where this came from, maybe fly launch as it’s not in the [repo]( keydb/fly.toml at main · fly-apps/keydb (github.com))?

[[services.script_checks]]
  interval = 5000
  timeout = 1000
  command = "/fly/check_ready.sh"
  restart_limit = 0

but this isn’t working in v2 anymore right? So how would I do a healthcheck without http accessible endpoint?

lubien · August 2, 2023, 8:15am

As of now we don’t have support for script_checkd in apps v2 and it’s not on our roadmap so far so the suggestion would be creating a simple HTTP server with any language you’d like.

For our postgres we created this helper that makes things easier:

Topic		Replies	Views
Health check history on Monitoring Page Fresh Produce postgres , appsv2 , machines	2	914	April 19, 2023
[We want your opinion] Health check alerts Questions / Help help-me-help-you , postgres , appsv2	3	646	April 26, 2023
Rails & Sidekiq Worker process health checks	1	364	November 20, 2023
Preview: health checks and alerting (deprecated)	20	3490	October 18, 2023
Monitoring Page role column Fresh Produce postgres , appsv2	0	250	May 30, 2023

Monitoring page health checks

Related topics