Launch: health checks and alerting (help us with pricing model)

We’re launching this next week, if you’re following along here you get to try it first (and give us pricing feedback!).

You can now install health checks in Fly apps, and configure them to alert you via PagerDuty or Slack when they fail.

This means you can configure your apps to wake you up at all hours of the night when things go wrong. And then fix them so you can sleep the next night. Like if you’re building a PostgreSQL cluster, and somehow the cluster loses its leader, you might be greeted with:

A little DBA elbow grease and you can satisfy PagerDuty:

Configuring health checks

Health checks can be scripts, TCP connections, or HTTP tests. The fly.toml in our example PostgreSQL app looks like this:

# stop accepting new connections while existing sessions complete
kill_signal = "SIGTERM"
# allow 5 minutes to cleanly shutdown
kill_timeout = 300

[checks.master-elected]
type = "script"
interval = 5000
command = "/fly/master_elected.sh"
restart_limit = 0

This calls a script named /fly/master_elected.sh every 5 seconds. If that script exits with a 0, no alerts. If it exits with 1 or greater, alerts! Here’s what this one actually does:

#! /bin/bash
set -e
export $(cat /data/.env | xargs)

status=$(stolonctl status)
mk=$((echo "$status" | grep "Master Keeper" | awk '{print $3}') || echo "")

if [ -z "$mk" ]; then
    echo "${status}"
    exit 2
fi

ip=$(echo "$status" | grep "^${mk}" | awk '{print $3}' | sed -e 's/:5432$//')

self=$(grep fly-local-6pn /etc/hosts | cut -f 1)

if [ "$ip" == "$self" ]; then
    echo "Master: self"
    exit 0
fi

echo "Master: $ip"

PagerDuty Handler

Configuring PagerDuty alerts is simple. Create a “Service” in PagerDuty, choose the “Sensu” integration type, and copying the key:

Then run flyctl handlers create --type pagerduty (you might need to flyctl version update first). It’ll prompt you for organization and your PagerDuty integration key.

Next, go break stuff.

Slack Handler

You can also spam your favorite Slack channel when alert status changes:

Just create an incoming Slack webhook, and run flyctl checks handlers create --type slack. It’ll prompt you for the webhook URL, channel, username, and user icon URL. @kyle made is a delightful default icon, though, so I don’t know why you’d ever want your own …

fly-chat-icon

Pricing for health checks

Do you all have a guess on how many checks you’d want to run per VM? We’re thinking about including three health checks per VM, and then charging $2/mo for up to 15. So if you configure 2 checks in your fly.toml, it’s free. If you configure 10 checks, we’d charge $2/mo/vm you have running. These would be prorated to the life of the VM.

7 Likes

This is AMAZING :heart_eyes:

1 Like

Could you support Discord webhook through slack compatibility?

Discord allow to use a webhook with Slack compatibility by adding /slack at the end of the URL.

It seems like it doesn’t work when adding this URL:

Error Invalid Slack webhook URL

That should already work! Let me do some testing, good catch.

@emiliendevos give it a try now? I just relaxed the URL validation.

It still doesn’t work unfortunately.

Should I update flyctl?

My URL looks like this one: https://discord.com/api/webhooks/799672577307443270/oZ2Sg1_evyxLcXXfwv1yjh1itd3xZ-L8a5dvTTGKyEufCjzbGcXPH2cIX0LOEzmBpepU/slack

Huh, I just used that exact URL and it worked fine. Will you try again in a few minutes? I wonder if you hit an old version of our API that hadn’t been drained yet.

Also that isn’t a real webhook URL is it? If it is we can edit it out of your post so no one else stumbles across it.

Don’t worry it’s a deleted webhook that I created just for the demo.

I can try again in 10 minutes.

1 Like

I do confirm that it’s working fine now after waiting 10 minutes.

1 Like

I don’t understand the pricing on this one :ok_man:t5: at all.

Ahhh! Don’t worry about it, the first pricing idea we had was garbage. We’ll come up with something else. :slight_smile:

2 Likes

Can you describe how “TCP connections, or HTTP tests” types of checks are supposed to be defined in fly.toml? Does it simply use what’s defined in [[services.tcp_checks]] and/or [[services.http_checks]] or do you need to specify a different type in [checks.my-check-name]?

Also tiny note, the Slack handler does not accept a user icon (anymore?).

There are two types of checks in fly.toml, the top level ones (like checks.my-check-name) here are used for alerting. The checks in [[services]] are used to control load balancing. If you want to make a top level tcp or http check for alerting, you can change that type = "script" to type = "http" or type = "tcp" and specify port, path, etc.

I don’t love how we did services in the config, and we’d like to get rid of a bunch of nesting someday.

TLDR: free 3 health checks per VM

and specify port, path, etc. - I don’t see this documented anywhere? Is the port the internal port? Does the path need to include the hostname - if yes, internal or external, etc.

Could you please provide some examples? Also, it would be ideal to have an entire schema definition for fly.toml (currently it is only defined as definition field with type JSON on Graphql schema - definition: JSON when looking at GraphQL Playground).

Also, I did try like this

[checks.up]
  type = "tcp"
  port = 4000

which passes parseConfig but fails with 500 at deployImage mutation. Please advise…

Oh yes! Docs are a problem on this, it’s part of why we only soft launched these checks. :smiley:

We’ll need to look into that error, it should be working.

@jakob.murko We only support top level script checks but validation wasn’t enforcing it. http & tcp checks work under services though.

1 Like

There’s likely a tension between your costs (resource consumption) and customer benefit (value of notifications|alerts).

It seems that you’d win by charging customers more for some function of the benefit, i.e. some function of notifications and the allowing services used but this could be challenging for you to price.

Healthchecking seems little different than regular app functionality, (i.e. pay for the resources consumed) until there’s an issue.

Perhaps a model based on paying (a fixed cost) for use of an alerting service then some amount by app (not VM) by (number of) notification channels (by number of alerts) per month?

I’m kind of curious about where we’re at with health checks for a VM since its been a bit now? I don’t see anything in the docs regarding it at the moment.

What are some suggestions for health check services in the mean time?

I saw there was healthchecks.io / gatus.io / statuscake.com / uptimerobot.com would love to know some favorites from the community here!