Why is my Java machine stuck?

one of my vms is stuck — on restart it emits a few logs then the app never comes up. had to scale a new instance just to keep traffic flowing.

this isn’t some edge-case baller setup. this is a jar that should boot. instead it’s 50/50 roulette whether fly actually does the basics.

afaict the only way to get real support is to pay $29/mo to report that my $4/mo app is bricks. forums are nice but they don’t fix prod.

pls advise how i’m supposed to run anything reliable here.

cmds:

haarolean@haarolean-laptop:[~]$ fly m restart --select -a kapybro
? Select machines: 328790ec713d28 empty-moon-9164 (started, region iad, process group 'app')
Restarting machine 328790ec713d28
  Waiting for 328790ec713d28 to become healthy (started, 0/1)
^CReleasing lease for machine 328790ec713d28...
haarolean@haarolean-laptop:[~]$ fly m restart --select -a kapybro
? Select machines: 328790ec713d28 empty-moon-9164 (started, region iad, process group 'app')
Restarting machine 328790ec713d28
  Waiting for 328790ec713d28 to become healthy (started, 0/1)
^CReleasing lease for machine 328790ec713d28...
haarolean@haarolean-laptop:[~]$  fly m stop --select -a kapybro
? Select machines: 328790ec713d28 empty-moon-9164 (started, region iad, process group 'app')
Sending kill signal to machine 328790ec713d28...
328790ec713d28 has been successfully stopped
haarolean@haarolean-laptop:[~]$  fly m stop --select -a kapybro
? Select machines:  [Use arrows to move, space to select, <right> to all, <left> to none, type to filter]
> [ ]  328790ec713d28 empty-moon-9164 (stopped, region iad, process group 'app')
haarolean@haarolean-laptop:[~]$ fly m restart --select -a kapybro
? Select machines: 328790ec713d28 empty-moon-9164 (stopped, region iad, process group 'app')
Restarting machine 328790ec713d28
  Waiting for 328790ec713d28 to become healthy (started, 0/1)
^CReleasing lease for machine 328790ec713d28...
haarolean@haarolean-laptop:[~]$  fly m scale -a kapybro

logs:


2025-12-10T20:30:44Z app[328790ec713d28] iad [info] INFO Sending signal SIGINT to main child process w/ PID 640
2025-12-10T20:30:49Z app[328790ec713d28] iad [info] INFO Sending signal SIGTERM to main child process w/ PID 640
2025-12-10T20:30:55Z app[328790ec713d28] iad [warn]Virtual machine exited abruptly
2025-12-10T20:30:56Z app[328790ec713d28] iad [info]2025-12-10T20:30:56.543026162 [01KC4YY8AC9PBCBPVS7S4GG5CT:main] Running Firecracker v1.12.1
2025-12-10T20:30:56Z app[328790ec713d28] iad [info]2025-12-10T20:30:56.543184099 [01KC4YY8AC9PBCBPVS7S4GG5CT:main] Listening on API socket ("/fc.sock").
2025-12-10T20:30:57Z app[328790ec713d28] iad [info] INFO Starting init (commit: 6f59af0a)...
2025-12-10T20:30:58Z app[328790ec713d28] iad [info] INFO Preparing to run: `java -jar appname.jar` as appname
2025-12-10T20:30:58Z app[328790ec713d28] iad [info] INFO [fly api proxy] listening at /.fly/api
2025-12-10T20:30:58Z runner[328790ec713d28] iad [info]Machine started in 2.479s
2025-12-10T20:31:06Z app[328790ec713d28] iad [info]2025/12/10 20:31:06 INFO SSH listening listen_address=[fdaa:9:7577:a7b:93:518c:a931:2]:22
2025-12-10T20:31:35Z app[328790ec713d28] iad [info] INFO Sending signal SIGINT to main child process w/ PID 642
2025-12-10T20:31:35Z app[328790ec713d28] iad [info] INFO Main child exited normally with code: 130
2025-12-10T20:31:36Z app[328790ec713d28] iad [info] INFO Starting clean up.
2025-12-10T20:31:36Z app[328790ec713d28] iad [info][   39.391238] reboot: Restarting system
2025-12-10T20:31:36Z app[328790ec713d28] iad [info]2025-12-10T20:31:36.845462413 [01KC4YY8AC9PBCBPVS7S4GG5CT:main] Running Firecracker v1.12.1
2025-12-10T20:31:36Z app[328790ec713d28] iad [info]2025-12-10T20:31:36.845622340 [01KC4YY8AC9PBCBPVS7S4GG5CT:main] Listening on API socket ("/fc.sock").
2025-12-10T20:31:38Z app[328790ec713d28] iad [info] INFO Starting init (commit: 6f59af0a)...
2025-12-10T20:31:38Z app[328790ec713d28] iad [info] INFO Preparing to run: `java -jar appname.jar` as appname
2025-12-10T20:31:38Z app[328790ec713d28] iad [info] INFO [fly api proxy] listening at /.fly/api
2025-12-10T20:31:39Z runner[328790ec713d28] iad [info]Machine started in 2.548s
2025-12-10T20:31:45Z app[328790ec713d28] iad [info]2025/12/10 20:31:45 INFO SSH listening listen_address=[fdaa:9:7577:a7b:93:518c:a931:2]:22
2025-12-10T20:32:15Z app[328790ec713d28] iad [info] INFO Sending signal SIGINT to main child process w/ PID 641
2025-12-10T20:32:16Z app[328790ec713d28] iad [info] INFO Main child exited normally with code: 130
2025-12-10T20:32:16Z app[328790ec713d28] iad [info] INFO Starting clean up.
2025-12-10T20:32:16Z app[328790ec713d28] iad [info][   39.488670] reboot: Restarting system
2025-12-10T20:32:29Z app[328790ec713d28] iad [info]2025-12-10T20:32:29.184869879 [01KC4YY8AC9PBCBPVS7S4GG5CT:main] Running Firecracker v1.12.1
2025-12-10T20:32:29Z app[328790ec713d28] iad [info]2025-12-10T20:32:29.185052266 [01KC4YY8AC9PBCBPVS7S4GG5CT:main] Listening on API socket ("/fc.sock").
2025-12-10T20:32:30Z app[328790ec713d28] iad [info] INFO Starting init (commit: 6f59af0a)...
2025-12-10T20:32:30Z app[328790ec713d28] iad [info] INFO Preparing to run: `java -jar appname.jar` as appname
2025-12-10T20:32:30Z app[328790ec713d28] iad [info] INFO [fly api proxy] listening at /.fly/api
2025-12-10T20:32:31Z runner[328790ec713d28] iad [info]Machine started in 1.973s
2025-12-10T20:32:38Z app[328790ec713d28] iad [info]2025/12/10 20:32:38 INFO SSH listening listen_address=[fdaa:9:7577:a7b:93:518c:a931:2]:22

It looks like one of the machines (328790ec713d28) is doing something very heavy with CPU, getting throttled, and then either times out itself or getting stopped due to autostop. Java programs are known to do some quite heavy initialization work on start, but since only one of your machines is having this issue, it sounds to me like it’s something else with application logic. Unfortunately, that’s where our insight stops since we can’t peek into your application source code.

If your app is doing heavy initialization task on start that would exceed the CPU limits of a 1x shared machine (see CPU Performance · Fly Docs for details), please consider spreading out that initialization task or scaling to a higher number of cores / performance cores. If initialization takes a long time, autostop might also not be the right option, since we have no way to know a started machine is still doing initialization work and would happily scale it down after some cooldown.

1 Like

It looks like this is stuck in a loop; I would wonder if your app does not become healthy, based on the healthcheck you have. It may help for us to see your fly.toml and maybe also your Dockerfile.

This occurred to me as well. Haarolean, can we see some CPU graphs from your Grafana instance?

  • my app can’t do anything before it starts
  • i’m sure your orchestration doesn’t even start it — there are no startup logs in this case

sure:

i did a few releases in a row, then this happened.

if that’s throttling, it’s a wild ux choice — the only machine refusing to start the app with zero hint why

Yep, that’s throttling: the dashed red line is sky-high there, :grimacing:.

(You’re supposed to keep CPU utilization mostly below the yellow, also-dashed horizontal line…)


There was talk back when CPU throttling was first introduced of maybe allowing people to buy a pool of credits to be used at startup, since needing a larger CPU just to handle the initialization phase was a particular complaint by JVM users, but that seems to have fallen by the wayside.

1 Like

The machine isn’t refusing to start the app. The app is being initialized, but it hasn’t reached a point where it can export logs. That’s the only reason it could be using so much CPU on start. If you are sure your application logic hasn’t been reached, then maybe this is just JVM initializing? In that case there might be some JVM debug logs you could enable. It is entirely possible that with your codebase growing after a certain size, JVM starts to use more CPU than shared-1x allows on startup, and that’s why you’re seeing throttling.

The easiest way to confirm whether this is the case is to temporarily scale up to a higher number of CPUs and observe CPU usage as the machine starts up. Then if that confirms the hypothesis, you can look into ways to stop JVM from using that much CPU on startup (so that you can scale the machine back down again).

1 Like

By the way, that could also explain why this happens intermittently: every machine has some “CPU credits” it could accumulate so long as it is not using all of its allocated CPU. However, the credit can be exhausted when the machine is restarted quickly a few times in a row like what’s done here, because there isn’t enough time for the CPU credit to “recharge” after it’s used by heavy JVM initialization. Maybe the quickest way to fix it here is to wait for a bit without attempting to start the machine and allow the credits to recover?

Haarolean; I seem to recall on another thread there was a suggestion that JVM programs do some kind of byte-code cache warmup. I don’t use Java, but AI chat gave me these ideas:

  • Use -Xms and -Xmx: Set initial and maximum heap size to reduce frequent resizing during startup.
  • Warm Up JIT Compiler
  • G1 Garbage Collector: If you’re not already using it, switch to the G1 garbage collector for better management of heap space

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.