Self-managed monitoring and alerting on Fly.io with Prometheus + Alertmanager

Alerts on Fly.io?

While currently Fly.io does not have built-in alerting functionality (not entirely accurate, we alert via e-mail when an application OOMs), the good news is that you can set up your own monitoring and alerting cluster! You can configure it to alert on almost any condition that can be monitored and expressed as a time-series metric (read: basically anything!), and send notifications to individuals or groups with a dozen supported mechanisms out of the box (Slack, e-mail, PagerDuty, push notification, custom webhooks…).

What is this wizardry?

There are a lot of monitoring and alerting solutions you can self-host on Fly.io. There are varying degrees of functionality, complexity, and user-friendliness. For this particular self-hosted alerting solution, though, I chose Prometheus with Alertmanager. There are a few reasons for my choice:

  • Prometheus is one of the best-known monitoring systems / time series databases in the industry. It’s battle-tested, performant, well-documented, and has a large community around it to provide help if needed. A simple solution like I’ll present here can scale to handle large volumes of metrics and alerts with fault-tolerance by using Prometheus federation and multiple Alertmanager units.
  • With the right set of exporters, Prometheus can ingest metrics about almost any aspect of your application stack. Also, most frameworks and many server-side applications provide ways of publishing Prometheus-compatible metrics directly without needing a specific exporter.
  • Prometheus configuration, while not extremely user-friendly, relies on declarative and relatively short YAML files. This makes it easy to store in source control. It can be handled with standard text manipulation tools, auto-generated, templated, and is easy to share with others to help with troubleshooting.
  • Crucially, a self-managed Prometheus instance can import data via federation from the Fly.io global Prometheus instance, which has general resource and performance metrics and can in turn store custom metrics from your applications.
  • Alertmanager integrates naturally with Prometheus and has common configuration and operational semantics, and neatly separates the “how to handle alerts” functionality from the “when to fire an alert” concerns handled by Prometheus.
  • Despite the close integration with Prometheus, Alertmanager is relatively agnostic to the monitoring tool it handles alerts for; it can process alerts sent by other monitoring solutions, like Grafana or Victoria Metrics and has a simple API with which custom alerts can be triggered.

All that said, one important aspect here is this: at the center of it all, we have the Fly.io global Prometheus instance, which collects and exposes the essential metrics about your apps’ and machines’ health. Given Prometheus’s prevalence in the industry as a metrics gathering solution, you can adapt almost any other monitoring and alerting product that understands Prometheus queries and/or federation, and build an alerting system that adapts to your needs.

Let’s get this thing working

First, let’s create a Fly.io app to host the monitoring cluster. I’m creating this in my personal organization but you can choose any of your organizations to host this. Choose a unique name for your app! my-monitoring-cluster is just an example and it’s already taken :wink:

fly app create -o personal my-monitoring-cluster

Next I’ll create a directory to store all the required files, mainly a handful of configuration and credential files and one script to deploy the cluster.

mkdir my-monitoring-cluster
cd my-monitoring-cluster

The first thing we need to do is write some configuration files. We need three of them across two services.

Prometheus configuration

The prometheus.yml file (syntax documented here) tells Prometheus three main things:

  • How to federate with the Fly.io Prometheus and scrape data from it, including an authentication token.
  • Where to locate the rules file (containing rules to fire alerts)
  • Where to send alerts fired by the rules file (Alertmanager).
  • Note: The Alertmanager hostname target looks cryptic! It’s using an internal DNS entry and leverages machine metadata, and I’ll explain it in more detail when we deploy the Alertmanager machine.
global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
# Where to send alerts
alerting:
  alertmanagers:
    - static_configs:
      - targets:
          # Update the below with your app's name instead of
          # MY-MONITORING-CLUSTER
        - alertmanager.role.kv._metadata.MY-MONITORING-CLUSTER.internal:9093
# Which alerts to fire, on which conditions
rule_files:
  - rules.yml
scrape_configs:
- job_name: prometheus
  honor_timestamps: true
  track_timestamps_staleness: false
  static_configs:
  - targets:
    - localhost:9090
# Where and how to get federated data from Fly.io Prometheus
- job_name: "flyprometheus"
  scrape_interval: 60s  # Go as high as you can so you don't kill the api!
  metrics_path: /prometheus/YOUR_ORG_NAME/federate # Replace your org name here
  honor_labels: true
  params:
      "match[]":
        - '{__name__=~".+"}'
  scheme: https
  static_configs:
    - targets: ["api.fly.io"]
  authorization:
    type: FlyV1  # Works only with read-only org token
    credentials_file: /etc/prometheus/fly_token

Next, rules.yml tells Prometheus when to fire alerts, and it follows the alerting rules syntax. Each rule has a Prometheus expression, and when that expression returns true (strictly, “Whenever the alert expression results in one or more vector elements at a given point in time”) the alert becomes “active” for the label sets that resulted in the expression returning some values.

In here I’ll set up a few alerts that fire when:

  • Load average higher than 0.70 for more than 5 minutes
  • Requests per second higher than 20, for more than 1 minute, but excluding an app called “super-high-load-app” - we know this one is happy with high loads.
  • All machines down for more than 1 minute. Note: Readers who are well-versed in PromQL will probably find issue with this query which is very awkward. I hear you - The problem here is that when all machines in a Fly.io app are down, there are no metrics returned from them, and as you’re well aware, PromQL doesn’t do great with “nonexistent” metrics. This query suffices for demonstration purposes, but we discuss a more reliable alternative in the Caveat Emptor section later on.
groups:
- name: system-alerts
  rules:
  - alert: HighLoad
    expr: avg by (app) (fly_instance_load_average) > 0.70
    for: 5m
    labels:
      severity: notify
    annotations:
      summary: High load (over 0.7) {{ $labels.app }}
- name: service-alerts
  rules:
  - alert: HighRPS
    expr: sum by (app) (rate(fly_app_http_responses_count{app!="super-high-load-app"}[1m])) > 20
    for: 1m
    labels:
      severity: page
    annotations:
      summary: More than 20 RPS {{ $labels.app }}
  - alert: AllMachinesDown
    expr: group by (app) (present_over_time(fly_instance_up[20m])) unless group(fly_instance_up) by (app) ==1
    for: 1m
    labels:
      severity: page
    annotations:
      summary: No machines are up for {{ $labels.app }}

Active alerts will be tagged with the labels for which the condition is true, and the labels field in rules.yml will unconditionally add custom labels to the reported data. Alertmanager can use this information to route the alerts based on “severity”. An important thing to notice is that labels don’t have any intrinsic meaning; it’s up to Alertmanager to interpret these and decide what to do, and those behaviors can be entirely customized in the Alertmanager configuration file.

Alertmanager

The alertmanager.yml file tells Alertmanager how to handle incoming alerts. As mentioned, Prometheus will send an alert event for every set of labels that activates an alert. Alertmanager can do three main things with these events, and it does all this based on the labels - Alertmanager knows nothing about the semantics or meaning of alerts and their labels, which is sometimes hard to conceptualize but can be really powerful to implement very complex behaviors.

  • Grouping: for example, if all our machines start getting high numbers of requests, Prometheus will fire alerts for every machine. Alertmanager can be set up to group them by application, so instead of N alerts about the same thing, a single alert summarizing the affected label sets (e.g. machines) is sent.
  • Inhibition: For example, if a single-machine database (bad idea!) goes down, an alert would be fired about that event, and all the app instances connecting to that database might alert on being unable to reach the database. The “database machine down” alert can inhibit the other three, as being both higher-priority and the likely cause of the others, so only a single alert is sent.
  • Routing: Based on labels, Alertmanager will match alerts with routing rules to decide who to send them to; for example, alerts about database machines can be sent to the DBA team by email on weekdays, but to the on-call team using PagerDuty after hours, and alerts about networking infrastructure can be sent to the NOC team which is staffed 24/7, and uses a custom webhook to display alerts on an internet-connected e-ink display - the latter is entirely unlikely, but this highlights the possibilities Alertmanager has.

Our very simple alertmanager.yml will:

  • Group alerts by app (all alerts raised by the same app will be sent in a single notification)
  • Send all alerts to “pushover test” which uses the Pushover service to pop notifications directly to the team’s phones. Note: Pushover offers a 30-day free trial, but if you prefer to use a different service or even just a simple webhook for testing purposes, feel free to swap the pushover receiver for any of the supported receiver services and configure accordingly.
  • If an alert fires with priority notify but there’s another alert with the same set of labels and priority page, then the notify alert will not be sent. If you look at the rules.yaml file above, you’ll notice the only rule with notify is the “high CPU” load alert. So if both that rule and the “high requests per second” alert fire, then only the “high RPS” alert will be sent, since the high request volume is likely the cause for the high CPU load.
route:
  group_by: ['app']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'pushover test'
receivers:
  - name: "pushover test"
    pushover_configs:
      - user_key_file: /etc/alertmanager/pushover_user_key
        token_file: /etc/alertmanager/pushover_token

inhibit_rules:
  - source_match:
      severity: 'page'
    target_match:
      severity: 'notify'
    equal: ['app']

Credentials and access tokens

Once we have prepared these three configuration files (two for Prometheus, one for Alertmanager), we need to set up some credentials. Three extra files are needed. These are small text files containing authentication tokens:

  • fly_token contains the Fly.io token used by Prometheus to pull data from the Fly.io Prometheus instance. Using a read-only token that’s restricted to a particular organization is safer; the token can be generated in the required format with this command: fly tokens create readonly personal | cut -f 2- -d" " > fly_token
  • pushover_user_key and pushover_token contain the user key and token, respectively, to be able to send notifications via Pushover. These can be obtained when signing up at Pushover.net and registering an app. Alternatively, as mentioned above, one can configure a different notification integration; each will require a different set of configuration files.

Warning: Ensure the pushover key files are NOT newline-terminated. Requests to the Pushover API will fail if the the contents of the user key file or token file end with a newline character cost me 2 hours to figure out.

Deploying the alerting cluster

Once all the files are ready, we can deploy the cluster. The traditional way to do this would be to create a fly.toml file and use fly deploy; however, I chose to instead deploy individual machines driven by a bash script, because the alerting stack is very simple and doesn’t need to have state that persists over reboots. In detail:

  • A fly.toml file is not strictly needed since we’re not exposing this service publicly and don’t need health checks for it.
  • We don’t need a build step since we’re using off-the-shelf public images for Prometheus and Alertmanager.
  • Volume configuration is very simple; for alerting purposes we don’t need truly persistent storage but since Prometheus needs to store files in disk, we will create a single volume to avoid filling the machine’s root filesystem. Prometheus doesn’t handle full disks very gracefully.
  • It is easier to deploy two different images under the same app this way; Prometheus and Alertmanager don’t map well to process groups since they are two distinct Docker images, and setting up one Fly.io app for each seemed overkill here.
  • Scaling using fly machine clone will be easy. Both Prometheus and Alertmanager can be scaled horizontally and support clustering, to have redundancy; so if the machine where Alertmanager is running goes down, another one can be ready to receive and process alerts.
  • Using fly machine commands directly can very easily be translated to using the Machines API, meaning this can be scripted or integrated in code to automatically update the alerting configuration.
  • I like doing horrible one-liner things in bash.

Deploy Prometheus

## Replace my-monitoring-cluster with your actual app's name

# Create a 10GB volume (note this will cost about $1.50/month)
fly volume create -a my-monitoring-cluster prometheus_data --size 10 --yes

# Start prometheus machine attaching the volume
fly machine run -a my-monitoring-cluster prom/prometheus \
    --file-local /etc/prometheus/rules.yml=rules.yml \
    --file-local /etc/prometheus/prometheus.yml=prometheus.yml \
    --file-local /etc/prometheus/fly_token=fly_token \
    --volume prometheus_data:/prometheus \
    --vm-memory=4096 --vm-cpus=2

The above command creates a machine under the pre-created app, using the official Prometheus image. It uses the files mechanism to write the configuration files we prepared, in the correct locations in the machine’s filesystem. It creates a 10GB volume for Prometheus to store files and mounts it in the expected location. And it provisions a slightly larger memory capacity since the default configuration for Prometheus is better tuned to a 4GB memory size. Warning: This machine configuration will cost about US$19/month. Be aware of this and scale accordingly if the cost goes beyond your budget. Prometheus can be configured to use less memory at the expense of more constant disk activity.

The output will be pretty terse, as we’re just creating the volume and starting a lone machine:

                  ID: vol_4xj8ejp3nxpm3me4
                Name: prometheus_data
                 App: my-monitoring-cluster
              Region: yey
                Zone: abcd
             Size GB: 10
           Encrypted: true
          Created at: 01 Jan 24 01:23 UTC
  Snapshot retention: 5
 Scheduled snapshots: true

Searching for image 'prom/prometheus' remotely...
image found: img_0lq747yz88546x35
Image: registry-1.docker.io/prom/prometheus:latest
Image size: 102 MB

Success! A Machine has been successfully launched in app my-monitoring-cluster
 Machine ID: 3d8d9345a49e18
 Instance ID: 01HQ8VWVCZDWTEDVC196WA12KE
 State: created

 Attempting to start machine...

==> Monitoring health checks
No health checks found

Machine started, you can connect via the following private ip
  fdaa:2:7d1e:a7b:144:dead:beef:2

This Prometheus instance can be accessed using Fly.io private networking but since there’s generally no need to fiddle with Prometheus, I’ve found that using fly proxy 9090:9090 -a my-monitoring-cluster and then opening http://localhost:9090 suffices for me to play a bit with Prometheus and check the alerts configuration. If you do so now, there are two places of interest in Prometheus’s top menu bar:

  • The “Alerts” tab shows all defined alerts and whether they are inactive (healthy), active (triggered but hasn’t alerted yet ) or firing (they’ve been sent to Alertmanager).

  • The “Status>Targets” page shows the targets Prometheus is getting data from. There should be two: prometheus (yes, Prometheus can get metrics about itself!) and flyprometheus, which is the remote instance we’re federating from. Both should show a state of “up” and they will indicate when the last scrape happened and how long it took.

With Prometheus running, let’s now deploy Alertmanager under the same app.

Deploy Alertmanager

# Replace my-monitoring-cluster with your actual app's name
fly machine run -a my-monitoring-cluster quay.io/prometheus/alertmanager \
    --file-local /etc/alertmanager/alertmanager.yml=alertmanager.yml \
    --file-local /etc/alertmanager/pushover_user_key=pushover_user_key \
    --file-local /etc/alertmanager/pushover_token=pushover_token \
    --metadata role=alertmanager \
    --vm-memory=512

I’m not showing the output here because it’s very similar to what the Prometheus deployment showed. This is nearly identical to the Prometheus deploy above, and uses the official Alertmanager image, although it needs to map three files instead of only two, and can get by with a smaller VM size since Alertmanager needs far fewer resources than Prometheus.

A thing to note is the metadata option: This assigns arbitrary key-value pairs to machines, and allows us to refer to machines by their metadata attributes. If you were wondering where the alertmanager.role.kv._metadata.my-monitoring-cluster.internal hostname came from, it’s constructed from the machine’s metadata according to this.

Testing alerts

It’s time to test the alerting service! Throw some requests at your app. I like to use hey which is simple and gets the job done, but any load generator will do. Locust? Grafana K6? a forkbomb of curl invocations? Just make sure you don’t generate too many requests or you might run into some rate-limiting protections. For the configuration we created, about 25 requests/second should do.

The below will fire 5 concurrent “clients”, each doing 5 requests/second, for a duration of 5 minutes. This is 25 requests/second and should be sufficient to trigger the alert.

hey -c 5 -q 5 -z 5m https://one-of-your-apps.fly.dev -disable-keepalive

While the load session is running, you can check go to Prometheus’s “Alerts” tab and reload. Soon you’ll see the HighRPS alert turn yellow, which means the alert is “pending”; the condition that fires it has triggered, but recall that it won’t actually fire until it’s seen a sustained 20 request-per-second load for more than a minute (the for: parameter in the alert definition).

After a minute, the alert will turn red which means it’s firing; it’s now being sent to Alertmanager for processing.

Alertmanager will wait a further 30 seconds (the group_wait parameter) for any other alerts about the same group, so it can group and send them as one. After 30 seconds though, Alertmanager should emit the alert. In our case it will try to contact Pushover and send the alert, which looks like this:

And when expanded, it provides more detail about the alert; importantly, it shows which sets of labels triggered it.

Success!

You’re now set up to receive alerts when events happen. Alertmanager will send a “resolved” alert once Prometheus signals that the condition is no longer triggering; in our case, when the load goes under 20 requests per second. To test this, you can stop the hey process and wait a couple of minutes.

Caveat emptor

While simple to use, this solution is perhaps naĂŻve in its configuration. Likely it does not follow best practice for a Prometheus/Alertmanager deployment:

  • TSDB storage is done in a relatively small volume; Prometheus keeps data for 15 days and cycles storage after that, but if you have a large number of time series, you could run out of space. There’s a Prometheus metric you can monitor and alert on (prometheus_tsdb_storage_blocks_bytes); setting this up is left as the proverbial exercise to the reader.
  • This Prometheus deployment is envisioned as a driver for alerts, and not for long-term monitoring and analysis. It’s best to treat it as ephemeral and rely on the global Fly Prometheus if you do have a need for long-term metrics storage for other purposes.
  • I did not implement any kind of access control, because it’s only accessible from the Fly.io network or with the fly proxy command. Keep in mind that other apps in your organization are able to contact your Prometheus and Alertmanager. Both services can be further secured if required, though that’s out of scope for this quick guide and best left to their official and extensive documentation.
  • That awkward PromQL query for “All machines down” bears some explanation.

How to detect when all machines are down

The query we used to identify when all machines are down is:

  group by (app) (present_over_time(fly_instance_up[20m])) unless group(fly_instance_up) by (app) ==1

And as we mentioned above, this query is less than optimal, since it’s mostly working around the fact that when there are no values emitted for a given metric, Prometheus doesn’t “see it” at all. And it just so happens that, when all machines are down in a Fly.io app, it emits no metrics and therefore Prometheus logs the metric as “empty” for a given point in time. This happens with the simple fly_instance_up metric and also other metrics we could use to infer the app is down, like fly_instance_net_sent_bytes.

The clunky solution presented above looks at whether the app’s “fly_instance_up” metric was present in the previous 20-minute window and returns the value of that, unless the fly_instance_up right now is non-empty. This contrived logic, if at least one machine was up 20 minutes ago but they are all down now, returns 1, and returns 0 if at least one machine up 20 minutes ago and is up now. So the alert condition is on == 1 .

A better solution, as recommended by Prometheus documentation, books and tutorials, is a metric that is always present for all targets (read: apps) and has an integer value (presumably 0-n indicating number of machines that are up), so then it’s easier to alert on the exact number (alert if the value is zero).

Taking a step back and focusing on the symptom and not the cause (and remembering that our users notice the symptoms first), we can instead monitor actual application status by querying via HTTP and alerting if the app does not respond after a timeout or responds with an error code. In essence we’d be creating an uptime monitor like Uptime Kuma or Site24x7, but using Prometheus tooling and integration with our alerting solution.

One easy way to achieve this is to add another machine with Prometheus-blackbox-exporter and the HTTP monitoring plugin, and add targets for our application’s public or internal HTTP services. The techniques we covered here can be used to craft a solution as described in several posts over the Internet such as this one. The general pattern is the one we’ve been using:

  • Create a configuration file for prometheus-blackbox-exporter
  • Deploy a one-off machine with fly machine run adding the configuration file we just wrote with --file-local
  • Update your prometheus.yml to add a new scrape config for the blackbox exporter
  • Update rules.yml to fire alerts when the above config shows the service is down.

Where to from here?

With a Prometheus instance under your control and Alertmanager to process alerts, the sky’s the limit as to the integrations that can be set up:

  • Prometheus-blackbox-exporter can be deployed in another machine under this app, and it can do things like checking arbitrary endpoints for health, presence or absence of specific content in responses, and other aspects.
  • Robusta or prometheus-am-executor can be set up to take concrete actions when alerts fire; a contrived example would be restarting machines that fail their database connectivity checks, or adding machines when particular metrics surpass a threshold: getting too many requests? fire up more machines! Although if that’s your need, then metrics-based autoscaling might be a better fit.
  • One could conceivably set up a second alerting cluster on a different region and have it monitor the first alert cluster. Redundant high-availability alerting at its finest.
  • I did not explore Alertmanager’s GUI. It can be accessed using fly proxy 9093:9093 -a my-monitoring-cluster and provides some functions like alert visualization, filtering and silencing. Notably, most of Alertmanager’s functionality can be accessed via an API, so a custom on-call solution (similar to PagerDuty) can conceivably be constructed using these blocks as a backend.

Conclusion

We built a quick and dirty alerting cluster in minutes, using a handful of simple text files and some of Fly.io’s machine-level primitives. It has its limitations due to the simplicity but it definitely works and it’s also meant to illustrate one important feature of the Fly.io platform: the flexibility it provides in allowing users to deploy almost any service, and its reliance on industry-standard tooling like Prometheus, Grafana, NATS, Vector and others. These features allow assembling custom solutions with relatively little effort, with existing components and leveraging prepackaged images, as well as the expertise of the respective communities.

10 Likes