FlareWarden — Uptime Monitoring Built on Fly.io (Show & Tell / Soft Launch)

Zane_Milakovic · March 12, 2026, 3:17pm

Hey everyone,

I’ve been building on Fly.io for about 4 years now — started with BrickDrop (a Lego rental subscription service), and now I’m working on something new: FlareWarden.

FlareWarden detects your SSL, DNS, payment providers, CDNs, and 700+ other dependencies in under 2 minutes — then verifies from 6 continents so you stop finding out about outages from your customers.

It’s a solo project. Go monolith, server-side rendered with Data-Star for reactivity, SQLite everywhere via Turso, and Fly.io holding the whole thing together. I wanted to share what I’ve built because a lot of it would be way more complex without Fly.

This is a soft launch. Free and paid plans are available, early adopter pricing is live. I’d love your feedback at the end.

The Architecture in Brief

18 machines across Fly’s global network, organized into 3 logical regions (US-East, EU-West, AP-South). Each logical region maps to multiple Fly regions for redundancy. Single Go binary. fly deploy. That’s it.

Here’s how I’m using the platform.

Fly Replay

Users get assigned to a region at signup based on the fly-region header. After that, every authenticated request needs to hit the right regional database. One response header and Fly replays the request to the correct region for me:

func (m *RegionEnforcementMiddleware) Enforce(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        cookie, err := r.Cookie("user_region")
        if err != nil || cookie.Value == "" {
            next.ServeHTTP(w, r)
            return
        }
        if cookie.Value != m.currentRegion {
            flyRegion := MapOurRegionToFlyRegion(cookie.Value)
            w.Header().Set("fly-replay", "region="+flyRegion)
            w.WriteHeader(http.StatusAccepted)
            return
        }
        next.ServeHTTP(w, r)
    })
}

Paired with fly.toml replay caching, the proxy remembers routing decisions for 24 hours so subsequent requests skip the middleware entirely:

[[http_service.replay_cache]]
  path_prefix = "/"
  type = "cookie"
  name = "user_region"
  ttl_seconds = 86400

This also handles Stripe webhooks — a webhook can land on any machine in any region, but the handler replays it to wherever that customer’s data lives. I didn’t have to build any webhook routing.

HTTP/2 Backend + SSE

The dashboard updates in real-time using Server-Sent Events via Data-Star. Fly’s h2_backend = true was key here — HTTP/2 multiplexing lets SSE connections share a single TCP connection instead of eating one connection each.

[http_service.http_options]
  h2_backend = true  # Enable HTTP/2 for better SSE performance

SSE connections use a 30-second heartbeat to stay alive within Fly’s proxy timeout. The app handles up to 500 concurrent SSE connections across the fleet.

6PN (IPv6 Private Networking)

This is probably the most interesting part of the system. With 18 machines, I needed exactly one machine to run each check at any given time, without a central coordinator.

Every machine resolves its own identity via fly-local-6pn (Fly’s IPv6 private networking), announces itself to peers, and builds a topology map. Then a deterministic hash assigns each monitor check to exactly one machine within the correct GDPR region:

func (ds \*DeterministicScheduler) ShouldExecuteCheck(monitor \*models.Monitor, monitorInternalRegion string, now time.Time) bool {
    timeSlot := canonicalTimeSlot(now, monitor.CheckInterval)
    topology := ds.topologyProvider.GetTopology()
    aliveMachines := topology.GetAliveMachinesInInternalRegion(monitorInternalRegion, 60\*time.Second)

    hash := hashMonitorTimeSlot(monitor.ID, timeSlot)
    machineIndex := int(hash % uint64(len(aliveMachines)))
    responsibleMachine := aliveMachines\[machineIndex\]

    return responsibleMachine == ds.machineID
}

No Redis, no leader election, no external coordination. Machines find each other over Fly’s private network, hash the work, and agree on who does what. When a machine goes away, the topology updates and the rest pick up its checks on the next tick.

Fly Volumes

Each machine keeps Turso embedded replicas on a Fly Volume. Reads hit the local SQLite file on disk, writes go to Turso’s edge. Dashboard queries are basically free — they never leave the machine.

[[mounts]]
  source = "flare_warden_core_data"
  destination = "/data"
  initial_size = "1GB"
  auto_extend_size_threshold = 80
  auto_extend_size_increment = "1GB"
  auto_extend_size_limit = "10GB"
  snapshot_retention = 7

Auto-extend means I don’t get paged at 3am because a volume filled up. Seven-day snapshot retention is a nice safety net too.

Custom Domains via Fly Machines API

We have public status pages, and customers can bring their own domain. When they add one, we verify the CNAME and then call the Fly Machines API to provision a TLS certificate:

const flyAPIBase = "https://api.machines.dev/v1"

func (s *FlyCertificateService) RequestCertificate(ctx context.Context, hostname string) (*FlyCertificate, error) {
    // POST to Fly Machines API to provision TLS cert for custom domain
    // ...
}

func (s *CustomDomainService) VerifyDomain(ctx context.Context, domain string) (*DomainVerificationResult, error) {
    // Verify CNAME points to {app}.fly.dev
    // Verify TXT record for domain ownership
    // Then provision certificate via Fly API
    // ...
}

Customer points a CNAME, we verify DNS ownership, hit the Fly API, and their status page has a valid cert. I didn’t have to think about Let’s Encrypt, renewal, or any of that — Fly just does it.

fly.toml

I think Fly doesn’t get enough credit for this. Rolling deploys, health checks, concurrency limits, TLS policy, volume mounts, metrics, security headers — it all fits in one readable file:

app = "flare-warden-core"
primary_region = "ord"

[deploy]
  strategy = "rolling"
  max_unavailable = 6  # Maintains 67% capacity with 18 machines

[http_service]
  auto_stop_machines = "off"
  min_machines_running = 4

  [http_service.concurrency]
    type = "requests"
    hard_limit = 100
    soft_limit = 75

  [http_service.tls_options]
    versions = ["TLSv1.2", "TLSv1.3"]

[metrics]
  port = 9091
  path = "/metrics"

Rolling deploys with max_unavailable = 6 means I update a third of the fleet at a time, health checks gate each batch. Bad deploy stops before reaching the other 12 machines. Sensitive stuff (Turso tokens, Stripe keys, encryption keys) goes in fly secrets, everything else lives in the toml. Clean split.

Prometheus Metrics + Grafana (Fly Managed)

We export 140+ Prometheus metrics on port 9091. Everything from check latency to SSE connection counts to webhook delivery rates. Fly’s managed Grafana picks them up with zero config on my end.

var CheckLatency = promauto.NewHistogramVec(prometheus.HistogramOpts{
    Name:    "flarewarden_check_latency_seconds",
    Help:    "Check execution latency in seconds",
    Buckets: prometheus.DefBuckets,
}, []string{"check_type", "region"})

I also keep a pre-built Grafana dashboard JSON in the repo so I can redeploy it if needed.

Fly Deploy (with Depot)

Deploys are fast. Multi-stage Dockerfile: compile Go with CGO (for libsql), build Tailwind, copy to slim Debian. Depot’s remote builders keep the cache warm between deploys so most builds are just the final binary step.

Partners Worth Mentioning

Tigris for object storage (service logos, data exports). S3-compatible, no egress fees, and we route uploads to the nearest Tigris region based on FLY_REGION so data stays local. Sentry for error tracking with GDPR-compliant PII sanitization before events leave the machine.

The SQLite Story (Turso + Fly)

I have to call this out. I might never have gone down the SQLite path if not for Fly’s blog post “I’m All-In on Server-Side SQLite”. I read that in mid-2022 and it changed how I think about databases. I’ve used SQLite on multiple projects since.

For FlareWarden, we run three tiers of Turso databases:

Account databases — one per region (us-east, eu-west, ap-south) for GDPR data isolation
System database — global, for cross-region coordination (machine heartbeats, topology)
Monitor databases — one per monitor, created dynamically via the Turso API

func (c *Client) CreateMonitorDatabase(ctx context.Context, monitorID, region string) (string, string, string, error) {
    dbName := fmt.Sprintf("mon-%s-%s", region, monitorID)
    groupName, ok := c.monitorGroupMap[region]
    // ...
}

A per-monitor database means real tenant isolation. Customer wants their data deleted? Drop the database. No complex purge queries across shared tables.

flyctl console

After a deploy, I use flyctl console to SSH into a live machine and run one-off scripts. Data migrations, incident simulation, SEO indexing. I build small CLI tools alongside the main binary (test-incident, indexnow) and just run them from there.

What FlareWarden Actually Does

Some screenshots:

You add a URL, and within 2 minutes FlareWarden has mapped your SSL certs, DNS records, and third-party dependencies. Then it watches all of it from multiple continents — if a check fails in one region, we verify from others before waking you up.

Feedback Welcome

This is early, and I want to hear from this community:

Would you use this? What would make you switch from your current monitoring setup?
What’s missing? Are there checks or integrations you’d expect that aren’t here?
How’s the pitch? Does “dependency detection in under 2 minutes” resonate, or is there a better way to frame it?
Design feedback? I’m a solo dev — the UI is all Tailwind + server-rendered HTML. Would love eyes on it.

Free plan available, no credit card required. Founding member pricing (40% lifetime discount) runs through June 2026.

flarewarden.com

Big thanks to @benbjohnson and @kurt and plenty of others here over the years. This platform lets a solo dev ship like they have an ops team behind them.