I’m new to Fly.io. I set up the open source newsletter app Listmonk first on Railway.app, and today on Fly.io.
Accessing the deployed Fly.io app is fine. But anything where the app (I’m guessing) needs to access its connected database, such as adjusting and saving settings, takes about 10 seconds to complete in the Fly.io version. On the Railway.app version, it’s between 30-60 ms according to browser dev tools.
Thanks for the reply. I went through the flyctl launch process to set up the app using a Dockerfile, and added a Postgres instance on Fly’s free level.
I’m not sure how the database is set up—the app does that automatically on install.
So… ssh into the app while it’s running and do a database operation? I can try, though that’ll take some digging. I’m not a developer, was just hoping for more of an install and forget situation.
The DB and the app are running in the same region, which is
There are no suspect logs in the DB logs
There are no suspect logs in the app’s logs
I looked for a configuration option to enable more verbose logs on listmonk, but couldn’t find one. They’re just using the default log level on their logger.
All in all, it doesn’t appear that listmonk is easily observable.
From your original post, the only thing that stands out is:
Thanks. Yes, same settings on RailwayEDIT: You’re right—those postgres settings are higher on the Railway app. I’ll try them in Fly.io! (the template file for Railway setup is here).
The only other thing that’s different is that the Dockerfile in the Railway setup runs a config.sh that populates a config.toml that listmonk uses for its own config. I used listmonk’s in-app env variables in my Dockerfile and skipped creating the config.toml file (as suggested by the app dev here.
@jerome Unfortunately, I spoke too soon. It may have been fast for a try or two, or I might have actually been testing on the browser tab with the Railway version. In any case, it’s still taking 10-11 seconds.
When I do save a setting, though, the app immediately responds with “Settings saved. Reloading app.” It takes 10 seconds for the page then to refresh and the spinner to stop. I don’t know if the message truly indicates the settings have been saved to the database, and that something in the app is taking too long to update—or if in fact the slowdown is with saving to the database.
Ok, I see what’s happening. I think your app is failing its health checks and keeps restarting. I’m not sure why yet though.
Do you know how much memory you have available on your Railway instance? I think the trial plan for Railway offers 512MB RAM. You only have 256MB on our basic instances.
2022/10/07 21:14:45 manager.go:494: error fetching campaigns: pq: cached plan must not change result type
2022/10/07 21:14:50 manager.go:494: error fetching campaigns: pq: cached plan must not change result type
2022/10/07 21:14:55 manager.go:494: error fetching campaigns: pq: cached plan must not change result type
2022/10/07 21:15:00 manager.go:494: error fetching campaigns: pq: cached plan must not change result type
2022/10/07 21:15:05 manager.go:494: error fetching campaigns: pq: cached plan must not change result type
2022/10/07 21:15:10 manager.go:494: error fetching campaigns: pq: cached plan must not change result type
2022/10/07 21:15:15 manager.go:494: error fetching campaigns: pq: cached plan must not change result type
2022/10/07 21:15:20 manager.go:494: error fetching campaigns: pq: cached plan must not change result type
2022/10/07 21:15:25 manager.go:494: error fetching campaigns: pq: cached plan must not change result type
Sending signal SIGINT to main child process w/ PID 520
Starting clean up.
It likely does. I’m not sure. I don’t see restarts of the whole VM, which is odd to me. Unless there’s a built-in restart mechanism that doesn’t involve shutting down the process.
What happens from our view point is:
Server is not accepting connections (because it’s shut down)
Our automated health check fails (because of the former)
Our proxy won’t send requests to your app if that happens, but it’ll retry for a while in case it becomes healthy.
After a few seconds, the app appears to be listening again on the correct port and accepts the connection (passes the health check) and our proxy sends a request to it (which appears to succeed)
Ok, thanks so much for this. Now as I try to troubleshoot, when I try to flyctl proxy my database, I get host was not found in DNS, even after restarting the database
Ok, the restart was a red herring. According to the developer:
This is by design. listmonk has to reload settings and reinitialize a number of controllers and clients, which is best done with a restart.
The delay, it must be a Fly.io thing. Best to ask them. The restart otherwise is instant and seamless.
So any idea how I might discover what’s taking the app so long to restart?
Actually, it looks like the app is restarting very quickly (In the logs I posted above, the total time between reloading on signal and http server started is about 300ms). But in the Fly.io instance of the app, the page doesn’t complete its refresh for 10 seconds.
It looks like it actually has something to do with a health process in the app that is taking 10 seconds to respond, whereas in the Railway instance, the same process is responding nearly immediately with an error, instead. So… nothing to do with Fly.io!
How have you tested this? If you ssh into your instance, you should be able to curl the service.
Can you try removing all health checks from your fly.toml and see if that helps?
Hmm, does that mean every time you save the settings, it restarts the app? I can see that being simpler for listmonk! Normally this wouldn’t be a problem, but it looks like the way we do health checks (via Consul) breaks it. Turning them off might help.
Ok, it may still be a Fly.io issue after all. I think something about the Fly instance is hanging up listmonk’s call to its own api (which is what that health item is).
I said that the Railway instance was just immediately returning an error for that health process. But it turns out initially returns an error and then a 200 immediately after, whereas the Fly.io instance keeps trying health over and over (but with no errors) for 10 seconds (See screen caps below).
When I save the listmonk settings with Chrome dev tools > network, open, this is what I see. Sorted by response time, Fly instance:
Railway initially gets an error but then immediately after gets a response from health.
For some reason, I don’t have curl when I ssh: /bin/sh: curl: not found.
But health is just an api at <my app's url>/api/health. When I visited that address in my browser, it responded in 33ms.
No change when disabling health checks and redeploying.
So it seems like what I need to hunt down is this: On Fly, the listmonk app’s call to its own /api/health address is hanging, whereas it doesn’t have that problem on Railway.