Requests to a suspended machine are taking a long time

When my machine is put into suspend mode due to inactivity, the first request that triggers the wake up takes a long time to be returned (or never returns) to the user

The logs actually show the machine waking up correctly:

2025-07-25 10:15:50.980 machine became reachable in 6.717712ms
2025-07-25 10:15:50.974 machine started in 223.935944ms
2025-07-25 10:15:50.972 Machine started in 218ms
2025-07-25 10:15:50.839 2025-07-24T22:15:50.839963167 [01K0Z6JEQ0E2TWFP5712F38WPH:fc_api] The API server received a Put request on “/logger” with body “{"log_path":"logs.fifo","level":"info"}”.
2025-07-25 10:15:50.839 2025-07-24T22:15:50.839889975 [01K0Z6JEQ0E2TWFP5712F38WPH:fc_api] API server started.
2025-07-25 10:15:50.839 2025-07-24T22:15:50.839571880 [01K0Z6JEQ0E2TWFP5712F38WPH:main] Listening on API socket (“/fc.sock”).
2025-07-25 10:15:50.839 2025-07-24T22:15:50.839442498 [01K0Z6JEQ0E2TWFP5712F38WPH:main] Running Firecracker v1.12.1
2025-07-25 10:15:50.750 Starting machine
2025-07-25 10:02:13.631 Virtual machine has been suspended

But a response takes over 30 seconds, so it appears to the user that it hangs. Refreshing the page resolves the issue as the machine is awake for the next request.

Here is the response time from Postman:

When I do it from the browser, it seems to never return (stays in the “pending” status indefinitely).

When I manually suspend the machine (as opposed to waiting for it to suspend itself), it seems to wake up and respond correctly.

I have also seen it wake up correctly, but the majority of the time, the response is never returned.

2 Likes

I have tried using auto_stop_machines = true rather than ‘suspend’, and that seems to reliably return from the stopped state. It would obviously be better to have the same behaviour from the suspended state

Hey @paulactually

Could you set flyio-debug: doit header on this request and post fly-request-id value from the response here, please?

Hi @pavel.

I did a request with that header this morning. The fly fly-request-id was: 01K16V7AZ4ZJ1VRBT2NDS0AD7T-syd. There was also the flyio-debug header with the value:

{"n":"edge-cf-syd1-777c","nr":"syd","ra":"125.236.220.56","rf":"Verbatim","sr":"syd","sdc":"syd1","sid":"0801693a191618","st":0,"nrtt":1,"bn":"worker-cf-syd1-519a","mhn":null,"mrtt":null}

Here is the full request and response headers:

Also, the logs for that request:

2025-07-28 09:04:37.576	
machine became reachable in 10.463414ms
2025-07-28 09:04:37.566	
machine started in 212.047468ms
2025-07-28 09:04:37.564	
Machine started in 206ms
2025-07-28 09:04:37.447	
2025-07-27T21:04:37.447628939 [01K16SX00PT0HVST9PY8HNS3YG:fc_api] The API server received a Put request on "/logger" with body "{\"log_path\":\"logs.fifo\",\"level\":\"info\"}".
2025-07-28 09:04:37.445	
2025-07-27T21:04:37.445903630 [01K16SX00PT0HVST9PY8HNS3YG:fc_api] API server started.
2025-07-28 09:04:37.445	
2025-07-27T21:04:37.445614085 [01K16SX00PT0HVST9PY8HNS3YG:main] Listening on API socket ("/fc.sock").
2025-07-28 09:04:37.445	
2025-07-27T21:04:37.445436032 [01K16SX00PT0HVST9PY8HNS3YG:main] Running Firecracker v1.12.1
2025-07-28 09:04:37.353	
Starting machine
2025-07-28 08:48:09.628	
Virtual machine has been suspended

Thanks. Paul.

Hmm, I don’t see anything wrong in our logs.

When you made the request, the proxy woke up the machine and established a new connection to it. It took the app ~30s to respond:

21:04:37.576626000: backhaul -> backend: Request { method: GET, ... }
21:05:08.654473000: backhaul <- backend: Response { status: 200, ... }

Does your app need to talk to some external resource (e.g. a database) to serve such request? If so, it could be that there are connections to the external resource in the pool that are already dead (because the machine was suspended), but it takes a while for the TCP/IP stack/client libraries to realize that once the machine is resumed.

Could you add some logs to make it easier to understand where the app spends the time while serving the request?

Hi Pavel,

I think you are correct. I added some logs, and the middleware runs reasonably fast, but the longest part is waiting for a database (mongodb) fetch.

Is there a way to make this faster?

Where is your MongoDB instance located, and is it a managed service e.g. with MongoDB Atlas? What language are you using for your web app? What library are you using for your Mongo connections? Per the advice from @pavel, is there a way in your connection library to force a new connection?

The MongoDB is hosted externally on Atlas. The web app is Node.js (Express). The Connection library uses the standard mongodb node driver.

I guess I can alter the connection options (e.g. connectTimeoutMS) however, if I lower it too much I will get false positives during normal network issues, and I am not sure what impact this will have.

Is there a recommended way to handle the reconnect after waking up from a suspended state?

I think this problem would benefit from some experimentation.

I would start by putting a Mongo test script on a machine, suspending the machine, waking it up, then running the script. It would be interesting to see if the script returns a result immediately, or it has the same problem. If it works flawlessly, then Node may be caching connections, and thus it may be worth working out how to invalidate them on wake.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.