Logs just stopped flowing

Hello

After yesterday’s fix of #3068, all worked fine, but now logging of my application just stopped.
I used logshipper to stream logs into logflare, and last message I get shipped was at “2021-11-11T21:15:17Z”. The same I get in flyctl logs output.

I’m sure that my application is working because it answers by http.

PS:
After flyctl restart logging resumed both in flyctl logs and in Logflare.
I`ll update this thread in case of new trouble.

Yes, it looks like log stop after 4 hours.
Restart was at 2021-11-11T22:36:05Z and last message that I get was at 2021-11-12T02:27:05Z.

Application is still alive and answers by http.

To confirm, are you seeing an absence of logs in logflare as well at the same time? I’m trying to make sure the problem is that logs are not coming out of Fly at all, as opposed to the fly logs connection being cut.

absence of logs in logflare as well at the same time

Yes, that`s correct.

To see what’s coming to the log-shipper, you can fly ssh console into it and run vector tap — that should provide some additional information.

Could you post the app ID as well? We can investigate on the logging system side.

It stopped again, now first message at 2021-11-12T10:45:41Z, last at 2021-11-12T15:18:41Z.

Main app ID is d31870f3.
Logshipper’s ID is be7fda82
vector tap in logshipper vm does not provides any output, just stucks.
After deploying new version of app, logging resumed, both in flyctl logs, logflare and in vector tap.

So I presume, the problem is in your logging subsystem, it overflows after 4-4.5 hours, or smth like this.

It probably doesn’t overflow, there’s no real log queue. But it’s possible the messages get lost in transit. It’s also possible the log vm gets wedged somehow.

Some things to try:

  1. Make sure you’re running the log collection VM in the same region as at least one app instance.
  2. Try configuring log collection with a different sink? If the VM that ships logs gets hung, it might be something about the specific log destination causing it.

It’s interesting that restarting the VM fixes it. That likely means our infrastructure is working properly and could mean Vector + Logflare are doing something strange here.

Ok, but if problem is vector-logflare related, why flyctl logs is stuck as well on the same log line?
After last restart app got ID 0877076f, and I got just few log lines, literally,
and thats all.
Application is running, it answers by http.

2021-11-12T20:49:53.443 app[881d9f88] ams [info] Preparing to run: sh -c /entrypoint.sh as root
2021-11-12T20:49:53.450 app[881d9f88] ams [info] 2021/11/12 20:49:53 listening on [fdaa:0:393b:a7b:23c8:881d:9f88:2]:22 (DNS: [fdaa::3]:53)
2021-11-12T20:49:53.500 app[881d9f88] ams [info] Start dev
2021-11-12T20:49:53.500 app[881d9f88] ams [info] time="2021-11-12T20:49:53Z" level=info msg="initConfig: load config file"
2021-11-12T20:49:53.500 app[881d9f88] ams [info] time="2021-11-12T20:49:53Z" level=info msg="Start main loop"

... 5 lines omitted ...

2021-11-12T20:49:59.308 app[0877076f] ams [info] time="2021-11-12T20:49:59Z" level=warning msg="(ueXxa7f53) success, took 45 millisec"
2021-11-12T20:50:16.256 proxy[881d9f88] ams [info] Health check status changed 'warning' => 'passing'

and no more lines even after 2.5 hours. Application produces about 100 lines per minute.

Region is ams on both app and logshipper vms:

❯ flyctl status checker-logshipper
App
  Name     = checker-logshipper
  Owner    = personal
  Version  = 2
  Status   = running
  Hostname = checker-logshipper.fly.dev

Instances
ID       PROCESS VERSION REGION DESIRED STATUS  HEALTH CHECKS RESTARTS CREATED
be7fda82 app     2       ams    run     running               1        2021-11-11T10:56:25Z


❯ flyctl status checker
App
  Name     = checker
  Owner    = personal
  Version  = 28
  Status   = running
  Hostname = checker.fly.dev

Deployment Status
  ID          = 6f0fc1b2-b4f4-5f1a-a4e5-8b92f003a6a5
  Version     = v28
  Status      = successful
  Description = Deployment completed successfully
  Instances   = 1 desired, 1 placed, 1 healthy, 0 unhealthy

Instances
ID       PROCESS VERSION REGION DESIRED STATUS  HEALTH CHECKS      RESTARTS CREATED
881d9f88 app     28      ams    run     running 1 total, 1 passing 0        2h31m ago

Oh I missed this. flyctl logs actually queries our Elasticsearch cluster. This is entirely separate from our nats setup that the vector/logflare setup uses.

If you restart just your app the logs start showing again (without restarting the vector app)?

If both fly logs and nats aren’t getting anything, there’s a chance the app isn’t actually sending logs. What kind of app is this? Is it possible the standard out is buffering?

Do you see log output when it shuts down? If you see logs like these, the log shipping pluming is working:

Shutting down virtual machine
Sending signal SIGINT to main child process w/ PID 510
1 Like

Thanks for clarification, thats definitely looks like my application problem.
I added some explicit logging, and it works 10 hours now.

1 Like