I’m working on implementing an app instance orchestrator. I’d started by taking a look at the implementation for the fly-autoscaler, it’s a great reference but I need more context-aware logic for my case.
An issue I’m seeing is that when a machine is restarted there is a huge delay (single-digit minutes) in when metrics are collected for that machine again. I’m wondering if there’s something that I can do on my end or perhaps some improvements on the Fly end with respect to scheduling metrics collection. Currently, to mitigate this issue I have to isolate restarted machines until the app-instance orchestrator detects that metrics for the machine have started to be collected again (uptime reset seen).
Note in the image below that the app instance represented by the blue line does not have metrics collected for an hour, it is stuck in a restart loop due to logic to simulate a memory leak… but also notice that metrics never come in for this instance for periods of up to six minutes before it becomes OOM.
I’d considered spinning up my own instance of prometheus to collect metrics so that I can avoid the latency in the Fly-provided metrics system (I understand you collect a lot of metrics), but I couldn’t find where I could point prometheus to scrape the metrics. Is this a private endpoint (ie fly-internal) or did I miss some information somewhere?
Yeah, I’m using the https://api.fly.io/prometheus/ endpoint to get system metrics, it’s delayed and has the same metrics gaps visualized by Grafana. I was trying to see if I could get the system metrics directly from whatever endpoint Fly is scraping them from but I don’t see where I can scrape those metrics from. I do see that I can add instrumentation myself and have my app expose system metrics… but Fly already does that for us so I thought it’d be cool if I can scrape the metrics from wherever they’re exposed rather than add redundant instrumentation.
It seems to only happen on this app that simulates memory leaks and regularly gets “oom-killed”. The app is supposed to simulate detectable unhealthy instance conditions (e.g. mem leak) in order to test the “app-instance orchestrator” that should see this and take action. The mock app will run out of memory after about 4 minutes of uptime but sometimes the metrics don’t come in time for the orchestrator to actually detect this simulated leak and instead the Fly system kills the app (oom-killed).
@dusty There’s definitely something undesirable going on with metrics collection. I’m finding that sometimes metrics just aren’t collected for the machine for huge amounts of time. I can give you the code for this app and the deployment config file if you ever want to try to repro, it’s like 200 lines of mostly boilerplate go.
This is the memory usage of a mock app, I’ve slowed the memory growth of the app so that it takes about 20 minutes for it to hit the hard-limit on memory. You can also see these gaps. Why are there almost no metrics for 30 minutes between 12:45 and 13:20? I am not interacting with this app during that period, it is only sitting there allocating memory in a loop:
func (eng *BehaviorSimEngine) Tick() {
eng.IncreaseMemConsumption(1)
}
// simulate increasing memory consumption
func (eng *BehaviorSimEngine) IncreaseMemConsumption(amountMB int) error {
size := amountMB * 1024 * 1024
block := make([]byte, size)
// Write to the memory to force actual allocation
for i := 0; i < size; i += 4096 {
block[i] = byte(i % 256)
}
eng.memBlocks = append(eng.memBlocks, block)
return nil
}
I wondered if maybe this issue was related to the machine specifically, I destroyed and recreated new machines for this app, I still have the same issue with missing metrics.