After yesterday’s fix of #3068, all worked fine, but now logging of my application just stopped.
I used logshipper to stream logs into logflare, and last message I get shipped was at “2021-11-11T21:15:17Z”. The same I get in flyctl logs output.
I’m sure that my application is working because it answers by http.
PS:
After flyctl restart logging resumed both in flyctl logs and in Logflare.
I`ll update this thread in case of new trouble.
To confirm, are you seeing an absence of logs in logflare as well at the same time? I’m trying to make sure the problem is that logs are not coming out of Fly at all, as opposed to the fly logs connection being cut.
It stopped again, now first message at 2021-11-12T10:45:41Z, last at 2021-11-12T15:18:41Z.
Main app ID is d31870f3.
Logshipper’s ID is be7fda82 vector tap in logshipper vm does not provides any output, just stucks.
After deploying new version of app, logging resumed, both in flyctl logs, logflare and in vector tap.
So I presume, the problem is in your logging subsystem, it overflows after 4-4.5 hours, or smth like this.
It probably doesn’t overflow, there’s no real log queue. But it’s possible the messages get lost in transit. It’s also possible the log vm gets wedged somehow.
Some things to try:
Make sure you’re running the log collection VM in the same region as at least one app instance.
Try configuring log collection with a different sink? If the VM that ships logs gets hung, it might be something about the specific log destination causing it.
It’s interesting that restarting the VM fixes it. That likely means our infrastructure is working properly and could mean Vector + Logflare are doing something strange here.
Ok, but if problem is vector-logflare related, why flyctl logs is stuck as well on the same log line?
After last restart app got ID 0877076f, and I got just few log lines, literally,
and thats all.
Application is running, it answers by http.
2021-11-12T20:49:53.443 app[881d9f88] ams [info] Preparing to run: sh -c /entrypoint.sh as root
2021-11-12T20:49:53.450 app[881d9f88] ams [info] 2021/11/12 20:49:53 listening on [fdaa:0:393b:a7b:23c8:881d:9f88:2]:22 (DNS: [fdaa::3]:53)
2021-11-12T20:49:53.500 app[881d9f88] ams [info] Start dev
2021-11-12T20:49:53.500 app[881d9f88] ams [info] time="2021-11-12T20:49:53Z" level=info msg="initConfig: load config file"
2021-11-12T20:49:53.500 app[881d9f88] ams [info] time="2021-11-12T20:49:53Z" level=info msg="Start main loop"
... 5 lines omitted ...
2021-11-12T20:49:59.308 app[0877076f] ams [info] time="2021-11-12T20:49:59Z" level=warning msg="(ueXxa7f53) success, took 45 millisec"
2021-11-12T20:50:16.256 proxy[881d9f88] ams [info] Health check status changed 'warning' => 'passing'
and no more lines even after 2.5 hours. Application produces about 100 lines per minute.
❯ flyctl status checker-logshipper
App
Name = checker-logshipper
Owner = personal
Version = 2
Status = running
Hostname = checker-logshipper.fly.dev
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
be7fda82 app 2 ams run running 1 2021-11-11T10:56:25Z
❯ flyctl status checker
App
Name = checker
Owner = personal
Version = 28
Status = running
Hostname = checker.fly.dev
Deployment Status
ID = 6f0fc1b2-b4f4-5f1a-a4e5-8b92f003a6a5
Version = v28
Status = successful
Description = Deployment completed successfully
Instances = 1 desired, 1 placed, 1 healthy, 0 unhealthy
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
881d9f88 app 28 ams run running 1 total, 1 passing 0 2h31m ago
Oh I missed this. flyctl logs actually queries our Elasticsearch cluster. This is entirely separate from our nats setup that the vector/logflare setup uses.
If you restart just your app the logs start showing again (without restarting the vector app)?
If both fly logs and nats aren’t getting anything, there’s a chance the app isn’t actually sending logs. What kind of app is this? Is it possible the standard out is buffering?
Do you see log output when it shuts down? If you see logs like these, the log shipping pluming is working:
Shutting down virtual machine
Sending signal SIGINT to main child process w/ PID 510