log-shippers died on Feb 24th

I have two projects in different orgs with fly-log-shipper deployed, which are configured to ship logs to backblaze.com.

Both log shippers stopped sending logs to backblaze on 24 Feb 2023.

Logging into the fly.io dashboard, I see that both instances are marked as “dead”.

  1. was there an incident on Feb 24th which might have caused this? I have checked https://status.flyio.net/ and found Fly.io Status - Machine API Unavailable on some hosts/regions., but it’s unclear if this is related.

  2. is there a way to configure log-shipper to restart when it dies?

  3. is there any built-in alerting to detect dead services like this?

  4. does anyone have any tips for diagnosing an outage like this? The logs stored on fly (e.g. flyctl logs) do not go back far enough to cover the incident.

  1. is there any built-in alerting to detect dead services like this?

In case anyone’s interested, I’m just running flyctl apps list --json and then check .Status of each app in my own monitoring code.

That’s suspicious. I wonder if backblaze had an issue. The machine API incident wouldn’t have affected already running machines.

You can run fly machine status <id> to get a look at the events. You might find non zero exit codes, which would point to Vector crashing (and suggest a backblaze issue).

One thing to check is the restart policy on the log shipper machines. You may want to set those to always.

You can run fly machine status <id> to get a look at the events. You might find non zero exit codes, which would point to Vector crashing (and suggest a backblaze issue).

I don’t have any machines. My log shippers are deployed as apps. Could this be part of the issue?

I can’t find any documentation for restart policies with respect to Apps rather than Machines. The README at GitHub - superfly/fly-log-shipper: Ship logs from fly to other providers refers to deploying an app rather than a machine.

can you share what you see when you run fly status?

App
  Name     = my-app-name          
  Owner    = my-org                      
  Version  = 9                            
  Status   = running                      
  Hostname = my-app-name.fly.dev  
  Platform = nomad                        

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS 	HEALTH CHECKS	RESTARTS	CREATED              
abcd1234	app    	9      	xxx   	run    	running	             	0       	2023-04-03T01:23:45Z	

Weird. Your app has 0 restarts and it says it’s running but it’s “dead” on the dashboard?

Anyway, it seems you’re on the nomad platform, i’ll suggest you migrate to the machines platform.
See this post on how to migrate fly migrate-to-v2 - Automatic migration to Apps V2

  1. is there a way to configure log-shipper to restart when it dies

After the migration is done, follow this https://fly.io/docs/apps/migrate-to-v2/#make-sure-the-machines-restart-policy-suits-your-app to learn how to configure the restart policy for your machines.

  1. is there any built-in alerting to detect dead services like this?

Not that i’m aware of. But, the machines platform has a public api that you can use to setup your own uptime checks. Working with the Machines API · Fly Docs

  1. does anyone have any tips for diagnosing an outage like this? The logs stored on fly (e.g. flyctl logs) do not go back far enough to cover the incident.

If you share the name of your log shipper app, maybe i can check and see if i’ll see anything weird. I can’t promise i’ll find anything, but i can look

1 Like

Weird. Your app has 0 restarts and it says it’s running but it’s “dead” on the dashboard?

I restarted manually around the time that I started this thread.

If you share the name of your log shipper app, maybe i can check and see if i’ll see anything weird. I can’t promise i’ll find anything, but i can look

Thanks! I’ll DM you.

Or not - I can’t see a way to send a DM.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.