Preview: health checks and alerting (deprecated)

kurt · January 14, 2021, 8:52pm

If you’re reading this now, you should know that many of these features never made it out of preview. Specifically, the Slack and PagerDuty handlers have been removed.

We built this on top of Sensu. Sensu was the wrong choice for a highly multitenant health checking system. We’d like to finish this feature someday, but for now it’s not something we are able to support.

We’re launching this next week, if you’re following along here you get to try it first (and give us pricing feedback!).

You can now install health checks in Fly apps, and configure them to alert you via PagerDuty or Slack when they fail.

This means you can configure your apps to wake you up at all hours of the night when things go wrong. And then fix them so you can sleep the next night. Like if you’re building a PostgreSQL cluster, and somehow the cluster loses its leader, you might be greeted with:

A little DBA elbow grease and you can satisfy PagerDuty:

Configuring health checks

Health checks can be scripts, TCP connections, or HTTP tests. The fly.toml in our example PostgreSQL app looks like this:

# stop accepting new connections while existing sessions complete
kill_signal = "SIGTERM"
# allow 5 minutes to cleanly shutdown
kill_timeout = 300

[checks.master-elected]
type = "script"
interval = 5000
command = "/fly/master_elected.sh"
restart_limit = 0

This calls a script named /fly/master_elected.sh every 5 seconds. If that script exits with a 0, no alerts. If it exits with 1 or greater, alerts! Here’s what this one actually does:

#! /bin/bash
set -e
export $(cat /data/.env | xargs)

status=$(stolonctl status)
mk=$((echo "$status" | grep "Master Keeper" | awk '{print $3}') || echo "")

if [ -z "$mk" ]; then
    echo "${status}"
    exit 2
fi

ip=$(echo "$status" | grep "^${mk}" | awk '{print $3}' | sed -e 's/:5432$//')

self=$(grep fly-local-6pn /etc/hosts | cut -f 1)

if [ "$ip" == "$self" ]; then
    echo "Master: self"
    exit 0
fi

echo "Master: $ip"

PagerDuty Handler

Configuring PagerDuty alerts is simple. Create a “Service” in PagerDuty, choose the “Sensu” integration type, and copying the key:

Then run flyctl handlers create --type pagerduty (you might need to flyctl version update first). It’ll prompt you for organization and your PagerDuty integration key.

Next, go break stuff.

Slack Handler

You can also spam your favorite Slack channel when alert status changes:

Just create an incoming Slack webhook, and run flyctl checks handlers create --type slack. It’ll prompt you for the webhook URL, channel, username, and user icon URL. @hkfoster made is a delightful default icon, though, so I don’t know why you’d ever want your own …

Pricing for health checks

Do you all have a guess on how many checks you’d want to run per VM? We’re thinking about including three health checks per VM, and then charging $2/mo for up to 15. So if you configure 2 checks in your fly.toml, it’s free. If you configure 10 checks, we’d charge $2/mo/vm you have running. These would be prorated to the life of the VM.

danwetherald · January 15, 2021, 6:26am

This is AMAZING

emiliendevos · January 15, 2021, 3:45pm

Could you support Discord webhook through slack compatibility?

Discord allow to use a webhook with Slack compatibility by adding /slack at the end of the URL.

It seems like it doesn’t work when adding this URL:

Error Invalid Slack webhook URL

kurt · January 15, 2021, 3:48pm

That should already work! Let me do some testing, good catch.

kurt · January 15, 2021, 4:11pm

@emiliendevos give it a try now? I just relaxed the URL validation.

emiliendevos · January 15, 2021, 4:14pm

It still doesn’t work unfortunately.

Should I update flyctl?

My URL looks like this one: https://discord.com/api/webhooks/799672577307443270/oZ2Sg1_evyxLcXXfwv1yjh1itd3xZ-L8a5dvTTGKyEufCjzbGcXPH2cIX0LOEzmBpepU/slack

kurt · January 15, 2021, 4:16pm

Huh, I just used that exact URL and it worked fine. Will you try again in a few minutes? I wonder if you hit an old version of our API that hadn’t been drained yet.

Also that isn’t a real webhook URL is it? If it is we can edit it out of your post so no one else stumbles across it.

emiliendevos · January 15, 2021, 4:18pm

Don’t worry it’s a deleted webhook that I created just for the demo.

I can try again in 10 minutes.

emiliendevos · January 15, 2021, 4:29pm

I do confirm that it’s working fine now after waiting 10 minutes.

rugwiro · February 19, 2021, 8:54am

I don’t understand the pricing on this one at all.

kurt · February 19, 2021, 3:29pm

Ahhh! Don’t worry about it, the first pricing idea we had was garbage. We’ll come up with something else.

jakob.murko · April 14, 2021, 10:15pm

Can you describe how “TCP connections, or HTTP tests” types of checks are supposed to be defined in fly.toml? Does it simply use what’s defined in [[services.tcp_checks]] and/or [[services.http_checks]] or do you need to specify a different type in [checks.my-check-name]?

Also tiny note, the Slack handler does not accept a user icon (anymore?).

kurt · April 15, 2021, 2:52am

There are two types of checks in fly.toml, the top level ones (like checks.my-check-name) here are used for alerting. The checks in [[services]] are used to control load balancing. If you want to make a top level tcp or http check for alerting, you can change that type = "script" to type = "http" or type = "tcp" and specify port, path, etc.

I don’t love how we did services in the config, and we’d like to get rid of a bunch of nesting someday.

calpaliu · April 15, 2021, 4:02am

TLDR: free 3 health checks per VM

jakob.murko · April 15, 2021, 1:05pm

and specify port, path, etc. - I don’t see this documented anywhere? Is the port the internal port? Does the path need to include the hostname - if yes, internal or external, etc.

Could you please provide some examples? Also, it would be ideal to have an entire schema definition for fly.toml (currently it is only defined as definition field with type JSON on Graphql schema - definition: JSON when looking at GraphQL Playground).

Also, I did try like this

[checks.up]
  type = "tcp"
  port = 4000

which passes parseConfig but fails with 500 at deployImage mutation. Please advise…

kurt · April 15, 2021, 1:15pm

Oh yes! Docs are a problem on this, it’s part of why we only soft launched these checks.

We’ll need to look into that error, it should be working.

michael · April 15, 2021, 8:12pm

@jakob.murko We only support top level script checks but validation wasn’t enforcing it. http & tcp checks work under services though.

DazWilkin · April 16, 2021, 2:17am

There’s likely a tension between your costs (resource consumption) and customer benefit (value of notifications|alerts).

It seems that you’d win by charging customers more for some function of the benefit, i.e. some function of notifications and the allowing services used but this could be challenging for you to price.

Healthchecking seems little different than regular app functionality, (i.e. pay for the resources consumed) until there’s an issue.

Perhaps a model based on paying (a fixed cost) for use of an alerting service then some amount by app (not VM) by (number of) notification channels (by number of alerts) per month?

uncvrd · November 2, 2021, 7:12am

I’m kind of curious about where we’re at with health checks for a VM since its been a bit now? I don’t see anything in the docs regarding it at the moment.

What are some suggestions for health check services in the mean time?

I saw there was healthchecks.io / gatus.io / statuscake.com / uptimerobot.com would love to know some favorites from the community here!

josharian · October 19, 2022, 1:18am

Any update on this? The commands in the top post to install a Slack alert appeared to work for me…but then nothing happened when I intentionally took down our service.

If it’s not working now, could you edit the top post to make that clear?

Could you share where this is on the roadmap, so people can decide how to invest their time?

Thanks!