I’ve got an instance of fly-log-shipper running in one of my orgs, for the purpose of sending logs to Datadog. It had been working nearly perfectly for a few weeks now, but a few days ago, one instance stopped appearing in it.
I have an
edge and a
prod instance, which are basically the same, barring a few scaling/sizing config changes, and some secrets differing in value. Stating the 21st, I stopped getting logs from edge. The only time log entries appear from edge is around a redeploy, and its just the deploy logs (VM shutting down, that stuff), but nothing actually from the application.
Prod has continued to log just fine during this time, running the same code.
You can see the logs fall off here, with that spike today being a deploy:
My vector config for the log-shipper is minimally changed; I just added a line to strip ansi escape codes
data_dir = "/var/lib/vector/"
enabled = true
address = "0.0.0.0:8686"
type = "internal_metrics"
type = "socket"
mode = "unix"
path = "/var/run/vector.sock"
type = "remap"
inputs = ["fly_socket"]
source = '''
. = parse_json!(.message)
.message = strip_ansi_escape_codes!(.message)
type = "prometheus_exporter" # required
inputs = ["fly_log_metrics"] # required
address = "0.0.0.0:9598" # required
default_namespace = "fly-logs" # optional, no default
type = "blackhole"
inputs = ["log_json"]
print_interval_secs = 100000
Thank you for the detailed report! We’ll do what we can to help you track this down. We have been investigating on our logging platform, and there currently seems to be a fairly steady log line rate from both -prod and -org apps. Are you still observing the problem?
We also noticed that the output from your -log-shipper app was relatively sparse. I’m not sure what would be causing this. As you pointed out, it looks like the edge app does generate logs that the log-shipper can see when it is redeployed. Checking logging output of the log-server itself for any unusual behavior might help us narrow things down further.
This doesn’t explain the cause of the gap in logs, but I can report that logs are now arriving from the -edge app
During initial investigation, I looked at the logs of the shipper, and didn’t see anything out of the ordinary.
Looking at my release history, we had several releases on the 26th, but none of which should be related to our logging output. We’ve had a dependency update to erlef/setup-beam in our github actions workflow almost every Monday, but I don’t think that would affect logging either.
Looking at our logs, it does seem like some logs are not making it to datadog, compared to when I view the pure output of
fly logs, and this is also happening for our prod instance
Background on our application:
It’s an elixir phoenix app, using Absinthe for GraphQL. Our logs are generated in JSON form by
logger_json, and then passed through the log shipper to datadog.
Our datadog configuration is fairly straightforwards, with minimal changes to the default pipeline. The only changes are the ones to the log shipper vector config, as outlined in the first post.
Could it be that the
parse_json! line is failing for some invalid nesting of JSON, and tossing the whole log, and we’re just seeing clusters of information getting tossed?