We recently did some work on our internal NATS cluster to make Fly log streams more reliable. Read on for the full story!
Background
The Fly Log Shipper is an app that reads in a stream of your organization’s app logs and ships them to another configurable destination. This is our recommended way to forward app logs to another provider, or even to a self-hosted log-storage app where you can customize your logging setup however you wish.
Under the hood, the log stream that powers the Fly Log Shipper is built on top of open-source NATS (the Neural Autonomic Transport System). We built a custom NATS proxy within the Machines API that sets up an org-scoped subscription using Fly.io credentials and passes all messages along to the connected client. You can connect any standard NATS client to our proxy to stream logs, even the NATS CLI:
nats sub --user $ORG --password "$(fly auth token)" --server=nats://_api.internal:4223 'logs.>'
We use this same log-streaming pipeline to stream ‘Live Logs’ in the dashboard and fly logs
in flyctl.
Since the NATS cluster is fully distributed across all of our servers, logs stream directly from your app to your client without any extra hops through a centralized log cluster.
This has all worked very well for our platform, and we’ve hardly needed to revisit it since its initial release back in 2021. However, a couple issues with the NATS cluster powering these streams recently demanded our attention.
1. Broken connections between regions
First, we had been receiving intermittent reports of logs missing or suddenly disappearing from NATS streams. Restarting the NATS server on the affected host usually managed to unblock any particular stopped stream, but this whack-a-mole approach was unsustainable and led to a general loss of trust in log delivery (we even posted about redundant logging strategies as a way to guard against log-shipper failures).
After correlating one of these intermittent failures with a period of elevated packet loss in a specific region, we managed to track down and report the specific issue in the upstream NATS server project related to an edge case in the handshake protocol, and the NATS team fixed the issue in less than 24 hours.
2. System-event message storms
Now all we had to do was upgrade NATS to the latest release containing the fix, which led to the next issue. Over the past several months, we had been experiencing increasingly-severe platform-wide ‘NATS brownouts’, or sudden, unexplained spikes in NATS-messaging activity typically correlated with a regional network outage or other networking interruptions. Although these brownouts didn’t typically trigger outages directly, they amplified existing incidents, adding extra noise and network traffic that made it harder to identify a root cause.
We hoped that upgrading NATS might fix some of these issues, including one fix for a memory leak we had been dealing with by restarting NATS frequently. Our initial upgrade test worked perfectly in a single region, but when we rolled it out more widely it triggered our biggest NATS brownout to date, generating 1.7tb/s (!) of traffic and causing an incident that overloaded a number of physical servers.
We quickly rolled back the upgrade and investigated what happened. We learned that NATS broadcasts internal system events across the entire network to help monitor the overall state of the system and support some advanced streaming features. One of these system events, STATSZ
(stats summary), is broadcast to all known nodes every time a new remote node is discovered. When a node restarts and discovers a total network of n
nodes, this results in up to 1+2+3+…+n
, or n*(n+1)/2
events. (In our tests, restarting a single NATS service generated ~3GB of data across ~150k messages!) Restarting too many NATS services at once triggered a flood of system events that saturated our network, which filled memory buffers and triggered out-of-memory errors, causing NATS services to restart, triggering another flood of system events…
Comparing notes with Synadia (the company behind NATS), we learned that the number of nodes in our NATS cluster was much larger than most typical NATS installations. We investigated adjusting our topology to use leaf nodes to limit the active cluster to a few centralized nodes, but this would have introduced additional hops for each message, doubling the latency and network usage of log subscriptions which would not be ideal.
Instead, we created a patch for Core NATS that will be in the next upstream release, that limits the rate of STATSZ
events to one per second. This reduced the single-node restart data to ~12MB, which was completely acceptable. Patch in hand, we managed to successfully upgrade the NATS cluster on our second attempt without any trouble, and happily watched our cluster load drop to much more comfortable levels:
There’s lots of further improvements we’re thinking about with our log shipper and our internal use of NATS more broadly, but these fixes and upgrades bring our log-streaming service to a much more stable spot compared to a few months ago.
Big thanks to the Synadia team for providing valuable guidance in our NATS investigations and for building such a fantastic open-source project!