Fly Log Streams are now a lot sturdier

We recently did some work on our internal NATS cluster to make Fly log streams more reliable. Read on for the full story!

Background

The Fly Log Shipper is an app that reads in a stream of your organization’s app logs and ships them to another configurable destination. This is our recommended way to forward app logs to another provider, or even to a self-hosted log-storage app where you can customize your logging setup however you wish.

Under the hood, the log stream that powers the Fly Log Shipper is built on top of open-source NATS (the Neural Autonomic Transport System). We built a custom NATS proxy within the Machines API that sets up an org-scoped subscription using Fly.io credentials and passes all messages along to the connected client. You can connect any standard NATS client to our proxy to stream logs, even the NATS CLI:

nats sub --user $ORG --password "$(fly auth token)" --server=nats://_api.internal:4223 'logs.>'

We use this same log-streaming pipeline to stream ‘Live Logs’ in the dashboard and fly logs in flyctl.

Since the NATS cluster is fully distributed across all of our servers, logs stream directly from your app to your client without any extra hops through a centralized log cluster.

This has all worked very well for our platform, and we’ve hardly needed to revisit it since its initial release back in 2021. However, a couple issues with the NATS cluster powering these streams recently demanded our attention.

1. Broken connections between regions

First, we had been receiving intermittent reports of logs missing or suddenly disappearing from NATS streams. Restarting the NATS server on the affected host usually managed to unblock any particular stopped stream, but this whack-a-mole approach was unsustainable and led to a general loss of trust in log delivery (we even posted about redundant logging strategies as a way to guard against log-shipper failures).

After correlating one of these intermittent failures with a period of elevated packet loss in a specific region, we managed to track down and report the specific issue in the upstream NATS server project related to an edge case in the handshake protocol, and the NATS team fixed the issue in less than 24 hours.

2. System-event message storms

Now all we had to do was upgrade NATS to the latest release containing the fix, which led to the next issue. Over the past several months, we had been experiencing increasingly-severe platform-wide ‘NATS brownouts’, or sudden, unexplained spikes in NATS-messaging activity typically correlated with a regional network outage or other networking interruptions. Although these brownouts didn’t typically trigger outages directly, they amplified existing incidents, adding extra noise and network traffic that made it harder to identify a root cause.

We hoped that upgrading NATS might fix some of these issues, including one fix for a memory leak we had been dealing with by restarting NATS frequently. Our initial upgrade test worked perfectly in a single region, but when we rolled it out more widely it triggered our biggest NATS brownout to date, generating 1.7tb/s (!) of traffic and causing an incident that overloaded a number of physical servers.

We quickly rolled back the upgrade and investigated what happened. We learned that NATS broadcasts internal system events across the entire network to help monitor the overall state of the system and support some advanced streaming features. One of these system events, STATSZ (stats summary), is broadcast to all known nodes every time a new remote node is discovered. When a node restarts and discovers a total network of n nodes, this results in up to 1+2+3+…+n, or n*(n+1)/2 events. (In our tests, restarting a single NATS service generated ~3GB of data across ~150k messages!) Restarting too many NATS services at once triggered a flood of system events that saturated our network, which filled memory buffers and triggered out-of-memory errors, causing NATS services to restart, triggering another flood of system events…

Comparing notes with Synadia (the company behind NATS), we learned that the number of nodes in our NATS cluster was much larger than most typical NATS installations. We investigated adjusting our topology to use leaf nodes to limit the active cluster to a few centralized nodes, but this would have introduced additional hops for each message, doubling the latency and network usage of log subscriptions which would not be ideal.

Instead, we created a patch for Core NATS that will be in the next upstream release, that limits the rate of STATSZ events to one per second. This reduced the single-node restart data to ~12MB, which was completely acceptable. Patch in hand, we managed to successfully upgrade the NATS cluster on our second attempt without any trouble, and happily watched our cluster load drop to much more comfortable levels:

There’s lots of further improvements we’re thinking about with our log shipper and our internal use of NATS more broadly, but these fixes and upgrades bring our log-streaming service to a much more stable spot compared to a few months ago.

Big thanks to the Synadia team for providing valuable guidance in our NATS investigations and for building such a fantastic open-source project!

18 Likes

Can this NATs infrastructure be used for things like KV? We were considering setting up a nats cluster node on each machine as a cachable KV store. Would this be unnecessary if there is access to a centralized NATS infrastructure on fly.io? Or is access and use limited to just streams for logging?

I noticed yesterday (2 July) that our app prereview-log-shipper stopped sending logs to Grafana Cloud about 2024-06-25T15:39:02 UTC. I restarted both instances and it started working again.

I can see log entries on Fly’s Grafana (for all apps) for the time period, but there isn’t anything there to suggest a problem (other than the Network I/O graph showing next to no activity).

Is it likely to be related to the changes? I’m going to try and set up alerting on Grafana Cloud to let us know if it hasn’t received any logs recently.

For me, the log shipper stopped sending logs around 6/25 and redeploying it fixed it.
It was still doing its job on July 2nd.

Hi @thewilkybarkid and @khuez,

On 06-25, we shipped a change with an unintended side effect affecting NATS client authorization. Though we fixed the bug quickly, it’s possible that affected log shipper apps needed a restart in order to reconnect successfully. See Fly + Datadog vector error - #3 by Sam-Fly for an earlier mention on this issue. You can also find a writeup on our Infra Log.

Though this issue did impact the NATS proxy service used by the Fly Log Shipper, it was an isolated incident unrelated to the stability improvements to the underlying NATS cluster discussed here.

2 Likes

Today I went to check our logs in BetterStack to resolve a prod error and discovered we have absolutely no log data in there. Our logshipper is no longer shipping logs at all and hasn’t been for at least 7 days.

I have raised a support ticket, but I am not very happy We don;t have a complex architecture yet on Fly, and just about everything has just stopped working at various points, without any alerts or communication - almost always due to “improvements” that Fly have made. The only status alerts I have are from April and they will not go away.

When you make major changes like this, do you not have a way of monitoring unusual behaviour afterwards?

I can’t believe for a second that we are the only ones who have suddenly lost all of our logs. Most of the alerts people have running off error logs will not trigger either because no logs are coming through. So they probably just haven’t noticed yet.

I’ll create an alert for no data coming through, but that does not help anyone get those logs back once they are gone.

TL;DR: for logs to get to BetterStack, a number of things must go right. I have reason to believe that the problem isn’t the delivery of messages to your log shipper instance.

This may not help, but I can attest to the fact that there is no global log outage. I have an app that is running in 11 regions; and I have things configured so that each instance sends a HEARTBEAT log entry and my logger app monitors for missing HEARTBEATS and issues a Sentry alert if heartbeats are missing.

This is possible because I wrote a custom log shipper. If you scroll back you will see a link to where I wrote up what lead me to do so: Multiple Logs for Resiliency · The Fly Blog. Because I’m not using BetterStack, my code is probably of little use to you, but if it would help let me know and I will share my code and data.

Thanks, the problem is indeed not to do with delivery, it is to do with the connections to the sink. Changes were made to the network topology as seems to happen regularly. Things were broken and our log shipper was left without any sink connections. A restart has fixed it, but those logs are gone. Another network issue that has left part of our system in a dead state and unable to recover without manual intervention.

It is a (fairly) regular occurrence across apps, databases and now logs, which we have seen across multiple orgs & apps.

Problems happen, but when stuff like this happens - we need to be told.

You might be able to find your logs in your app’s dashboard under the “Log Search” tab.

Thanks, but no - all we have had since the 23rd June is a daily collection event.

Prior to that date and since I restarted today, we have proper logs.

It isn;t such a major drama for us just now, but we are about to move everything over to Fly from AWS. Stuff randomly dying is very worrying. It happens regularly and makes me wonder if I should not make that leap after all.

Edit: for the record - part of this is on me, because I made some monitoring trade offs due to what we are using Fly for right now, but the problem always seems to be that Fly make major changes to the network, break a load of stuff and then don;t fix what is broken and don;t notify those affected.

Paying customers trusting Fly with their businesses do not expect the infrastructure to be treated like a proof of concept. If you break it, you fix it. If I break it, I fix it. Otherwise I might as well just spin up a bunch of ubuntu servers and do it myself

Hmm strange. My log shipper to Loki died on 6/25 and I restarted it on 7/2. But the logs are all there when I use the “Log Search” feature that was added a few months ago.

Hi @RKane,

That’s the log for the Fly Log Shipper app itself (which isn’t very chatty), you can find the logs for your other applications separately.

Let me clarify our current ‘responsibility model’ for these components. The NATS-based log-streaming service itself is a part of the Fly.io platform and fully our responsibility, we actively monitor and ensure this is behaving properly. (If we break it, we fix it.)

The Fly Log Shipper is an open-source repository to help you use the log-streaming service to ship logs, it’s our recommended solution to forward app logs to another provider, and we can help troubleshoot and debug related issues through our support channels. You deploy this app entirely on your own and fork or customize it however you wish. However, the app instances you deploy are not actively managed or maintained by Fly.io, so things like version updates, monitoring and recovery are your responsibility. Though I would call it more mature than ‘proof of concept’, the Fly Log Shipper app is not a fully-managed part of our ‘infrastructure’ at this point.

I understand what these logs are - thanks. They clearly show nothing happening during that period.

You guys need to stop with this “not managed” worming out of responsibility attitude. Seems to be the default response these days.

I made no changes to our production app
I made no changes to our logshipper deployment
I made no changes to BetterStack
All of our logs from applications on other platforms have been just fine.

You did this:

This one was on our side. We shipped a change to some auth pieces earlier today that had unintended side effects on Log Shipper apps (or any apps that connect to NATS with an Oauth token).

Our logs stop working.

Putting this specific issue aside for a second, the bottom line is this:
When you make high risk or breaking changes to any system components without informing people affected by those changes, and do not have the observable coverage to understand when things have broken, you have failed Change Management 101. If there is no audit trail of changes to my system, I cannot take appropriate action in response to those changes.

Say I leave my (imaginary) collection of priceless vases in my lounge, in their display cabinet. I go away on holiday. You come round to my house uninvited and start playing football in my lounge.

When I come back and find my vases broken, you can;t say: “Yes, vases were broken, but we are only responsible for the football and the football is fine”.

Let me be clear. NONE of the problems I have had with Fly have been due to updates, upgrades or maintenance on my part. They are ALL triggered by changes that Fly made to system components without my knowledge.

You are not just hosting To-Do apps for junior devs doing tutorials any more, you are supposed to be a serious player, taking money to provide a stable container hosting and deployment platform.

I have system components that I last deployed a decade ago, that are still running fine and haven;t missed a beat. You know why? Because they are sitting on a locked down server in a data center and people aren;t rushing out high risk changes every 5 minutes, so they can run off and write a blog post about how super excited they are.

Still, I think this response might be the final red flag for me to run, before I get too locked in to Fly.

I love what you guys are trying to do, and I may come back once you get your act together, but this recent attitude change is a big red flag for me.

As a CTO, my “responsibility model” is not to engage suppliers who cannot provide the service I am paying them for. When your changes stop my customers from using my product, we have a problem. That hasn;t happened yet, but that is because we have deliberately not deployed business critical apps to Fly yet.

Hey, this sounds frustrating indeed.

To be clear, the Grafana-based log search mentioned by @khuezy is not built on top of log-shipper or NATS. That system has been running solid and currently gives you 30 days of log retention. We do monitor and manage this service.

Could you check in there again? You can find your log entries by selecting the target app in the dropdown.

Also, we have an unmerged PR on log-shipper that can detect stalled subscriptions and restart Vector. If this seems like something useful, let me know.

1 Like

In the interest of fairness, just want to call out that I did not read/understand this message properly - my apologies.

You might be able to find your logs in your app’s dashboard under the “Log Search” tab.

I did not know about the beta release of this feature (app log retention), so I read app as meaning the logshipper app.

I doesn’t change anything other than I can now at least see application logs from that period. They are not completely lost. However the broken component here is the logshipper and everything downstream of that as a result.

My complaint is that Fly keep breaking stuff and then say “not our problem” when you question why the apps you haven’t touched no longer work.

Sorry Joshua, I didn’t see this before my latest reply, so please see there. But again - the issue here is that the logshipper (like many other things over the last year) was broken by rushing out “improvements”. Breaking something which has been working perfectly well. Even then, it isn;t so much the breaking of said things (mistakes happen), its the “who cares? Not our problem” responses that seem to come back from support requests.

That attitude leads to a culture of not focusing on getting better at not breaking things.

Edit:

That system has been running solid and currently gives you 30 days of log retention. We do monitor and manage this service.

That is a Beta feature and new, so that doesn’t help the majority of people who have metrics, dashboards and alerting built on top of downstream services. Though it is great that you have introduced the feature (provided it doesn;t break everything else).

Hi @RKane,

I’m sorry this led to a frustrating experience for you. To recap: The current Fly Log Shipper code has an issue where it doesn’t automatically reconnect in some cases, so you may need to manually restart your log-shipper app after getting disconnected. This can happen due to network interruptions or other platform issues (like the one that occurred a couple weeks ago). Joshua pointed out a potential fix that may be useful.

I don’t mean to convey an ‘attitude’ or ‘culture’ change, only to clarify that Fly.io doesn’t currently offer a managed log-shipper service, only code that you deploy yourself and that you’re responsible for managing. The lack of a managed log-shipper service is ‘our problem’ but I also want to set clear expectations on the current state of our platform, so if your business-critical apps require a log-shipper service that ‘just works’ that you don’t need to monitor (and occasionally restart) yourself, please hold off until we’ve built one.

can you confirm whether something like the event on 6/25 seems to have occurred between 10-11am on 9/13? the log shipper for several projects (in several different organizations) stopped shipping logs around then. re-deploying seems to fix the issue, but that causes me to worry about continued interruptions like this.

we can of course add some heartbeat logging so that we notice when our log streams dry up, but given the aim of the previous disruption was to make the logging pipeline more robust, is there some action that we need to take as log-shipper users to benefit from that? e.g. upgrade the version of vector we’re using or something similar?

There was some new changes pushed to the log, eg:

2024-09-26T18:43:38.646261622 [<ID>:fc_api] The API server received a Put request on "/logger" with body "{\"log_path\":\"logs.fifo\",\"level\":\"info\"}".

This new log line appeared between the dates you mentioned: 9/17.

imo that log line shouldn’t be logged at all, it’s just noise.

it looks to me like 9/13 was the date of the disruption, so this would have been several days after logs stopped flowing for us.