App was down for 20 minutes in the middle of the night and then restarted. How to investigate?

qqwy · January 6, 2022, 9:21am

Hi there! Yesterday (2021-01-05) between 23:13:56 and 23:34:24 (UTC+0), our app (currently running on a single node) did not respond to any uptime pings or other traffic and afterwards restarted.

In the metrics page of the dashboard, we see a clear dip in that time interval where ‘VM service concurrency’ and ‘data transfer’ went down to 0.

What is strange however is that this restart is not shown under fly status.

There is also no indication anywhere on why our Elixir app restarted. We would have expected to have seen some info about memory usage in our in-app logs. (We are not currently running a fly-log-shipper inside our cluster.)

Could you investigate what happened?

FrequentFlyer · January 6, 2022, 10:00am

Hi,

Where is/was your app running?
If it was LHR by any chance, it may have been this.

qqwy · January 6, 2022, 10:08am

It is supposed to normally be running on ams (which is the sole datacenter in its main ‘region pool’) but currently it is listed as lhr(B) (lhr and fra are in its ‘backup region’ pool) in fly status.

Thank you for clearing this up; I’ll immediately subscribe to the updates mailer .

Topic		Replies	Views
App is down, monitoring says it's ok, how to troubleshoot? Questions / Help elixir	7	428	June 15, 2022
Unexpected Restarts metrics	3	753	September 17, 2020
Global outage (maybe already recovering) just now? proxy	5	129	December 19, 2024
Issues in LHR region?	4	495	January 25, 2023
Sudden decrease in throughput, no recent changes Questions / Help elixir	13	522	October 21, 2022

App was down for 20 minutes in the middle of the night and then restarted. How to investigate?

Related topics