Machine not starting automatically when receiving requests

Without any changes to my application it was working yesterday (2024-10-16) but sometime after 3pm EST it stopped working, which I noticed at 11pm EST when I have a script backup the data on the volume which was unable to do so because the machine didn’t wake up when it pinged it.

I have confirmed some things:

  • I can start the machine manually in the fly.io dashboard
  • I can confirm that the port it’s listening on is open
  • I can ping the dedicated ipv4 address my application uses
  • Even after starting the machine, traffic to the app does not reach the machine
  • A separate app I have hosted on fly.io in the same region (yyz) is still working, traffic sent to it causes firecracker to start the machine(s).
  • fly apps list shows the app status as deployed
  • While the machine is manually running, fly machine status shows
    ...
    State: started
    HostStatus: ok
    ...
    Event Logs
    STATE   EVENT   SOURCE  TIMESTAMP                       INFO 
    started start   flyd    2024-10-17T11:37:42.47-04:00 
    created launch  user    2024-10-17T11:37:33.933-04:00
    

Things I have tried already:

  • fly deploy the app with the same configuration as before
  • Release the dedicated ipv4 and request a new one
  • fly deploy the app with no [[services]] and deploy it again with the [[services]] added back
  • fly deploy the app changing its memory
  • fly deploy the app changing its cpu

I suspect this is something to do with the networking configuration fly uses between the app which listens and starts the firecracker VM/machine when traffic is sent to the address of the app however I do not know how I can invoke a recreation of those networks like one would in kubernetes.

Was this working prior to 2024-10-16? I haven’t seen any proxy issues on my apps, so if it is a problem, it’s likely region specific. Where is your backup script running, and how is it communicating w/ your machine?

Yes this has been working smoothly for over a month prior to 2024-10-16.

It isn’t just the backup that doesn’t work now, the whole app doesn’t work because the machines never wake up when traffic is sent to it. My users are unable to use the application right now.

The backup script first curls the the app address to wake it up, waits for it to start up then establishes a connection to it using flyctl ssh issue and flyctl proxy before rsyncing the data I need off the volume. Since the machine never starts up after the curl request is sent the rest of the script fails.

The region it is running in is yyz but so is the other app I have that is working still.

There’s a new status update: https://status.flyio.net/incidents/ftd07gnytjl4

2 Likes

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.