Very slow response time

I’m new to Fly.io. I set up the open source newsletter app Listmonk first on Railway.app, and today on Fly.io.

Accessing the deployed Fly.io app is fine. But anything where the app (I’m guessing) needs to access its connected database, such as adjusting and saving settings, takes about 10 seconds to complete in the Fly.io version. On the Railway.app version, it’s between 30-60 ms according to browser dev tools.

Any ideas on troubleshooting this?

10 seconds is very slow.

How is your app setup with its database? Where is the database hosted?

To troubleshoot it, I’d try to see if the operation is fast from your app to figure out if the problem is with us.

I tried the live demo, but modifying settings is not allowed there.

Thanks for the reply. I went through the flyctl launch process to set up the app using a Dockerfile, and added a Postgres instance on Fly’s free level.

I’m not sure how the database is set up—the app does that automatically on install.

So… ssh into the app while it’s running and do a database operation? I can try, though that’ll take some digging. I’m not a developer, was just hoping for more of an install and forget situation.

@jerome If you want to try troubleshooting, the Dockerfile and flyctl install command I used to set it up are in this post

Looking at your setup more closely:

  • The DB and the app are running in the same region, which is :+1:
  • There are no suspect logs in the DB logs
  • There are no suspect logs in the app’s logs

I looked for a configuration option to enable more verbose logs on listmonk, but couldn’t find one. They’re just using the default log level on their logger.

All in all, it doesn’t appear that listmonk is easily observable.

From your original post, the only thing that stands out is:

  --build-arg POSTGRES_MAX_OPEN=3 \
  --build-arg POSTGRES_MAX_IDLE=1

You probably want a bit more than that. Are you using the same settings on Railway?

1 Like

Thanks. Yes, same settings on Railway EDIT: You’re right—those postgres settings are higher on the Railway app. I’ll try them in Fly.io! (the template file for Railway setup is here).

The only other thing that’s different is that the Dockerfile in the Railway setup runs a config.sh that populates a config.toml that listmonk uses for its own config. I used listmonk’s in-app env variables in my Dockerfile and skipped creating the config.toml file (as suggested by the app dev here.

1 Like

That was it, @jerome. Thank you so much for spotting that!

2 Likes

Is performance comparable now?

@jerome Unfortunately, I spoke too soon. It may have been fast for a try or two, or I might have actually been testing on the browser tab with the Railway version. In any case, it’s still taking 10-11 seconds.

When I do save a setting, though, the app immediately responds with “Settings saved. Reloading app.” It takes 10 seconds for the page then to refresh and the spinner to stop. I don’t know if the message truly indicates the settings have been saved to the database, and that something in the app is taking too long to update—or if in fact the slowdown is with saving to the database.

Any suggestions on sorting this out further?

Ok, I see what’s happening. I think your app is failing its health checks and keeps restarting. I’m not sure why yet though.

Do you know how much memory you have available on your Railway instance? I think the trial plan for Railway offers 512MB RAM. You only have 256MB on our basic instances.

Can you try fly scale memory 512 and report back?

No change in performance after fly scale memory 512 I’m afraid.

I see new logs.

2022/10/07 21:14:45 manager.go:494: error fetching campaigns: pq: cached plan must not change result type
2022/10/07 21:14:50 manager.go:494: error fetching campaigns: pq: cached plan must not change result type
2022/10/07 21:14:55 manager.go:494: error fetching campaigns: pq: cached plan must not change result type
2022/10/07 21:15:00 manager.go:494: error fetching campaigns: pq: cached plan must not change result type
2022/10/07 21:15:05 manager.go:494: error fetching campaigns: pq: cached plan must not change result type
2022/10/07 21:15:10 manager.go:494: error fetching campaigns: pq: cached plan must not change result type
2022/10/07 21:15:15 manager.go:494: error fetching campaigns: pq: cached plan must not change result type
2022/10/07 21:15:20 manager.go:494: error fetching campaigns: pq: cached plan must not change result type
2022/10/07 21:15:25 manager.go:494: error fetching campaigns: pq: cached plan must not change result type
Sending signal SIGINT to main child process w/ PID 520
Starting clean up.

I think it’s crashing in some shape or form.

Oh—I wasn’t aware you were watching the app activity that closely.

That was me, trying different settings and redeploying.

But these logs are without any redeploy. Just saving settings in the listmonk app and waiting for refresh:

2022-10-07T21:15:45.006 app[1d886281] lax [info] 2022/10/07 21:15:45 init.go:736: reloading on signal ...
2022-10-07T21:15:45.006 app[1d886281] lax [info] 2022/10/07 21:15:45 init.go:711: HTTP server shut down
2022-10-07T21:15:45.096 app[1d886281] lax [info] 2022/10/07 21:15:45 main.go:95: v2.2.0 (bbbf28c 2022-07-30T18:18:24Z)
2022-10-07T21:15:45.096 app[1d886281] lax [info] 2022/10/07 21:15:45 init.go:128: reading config: config.toml
2022-10-07T21:15:45.097 app[1d886281] lax [info] 2022/10/07 21:15:45 init.go:255: connecting to db: mb-newsletter-db.internal:5433/mb_newsletter
2022-10-07T21:15:45.296 app[1d886281] lax [info] 2022/10/07 21:15:45 init.go:566: media upload provider: filesystem
2022-10-07T21:15:45.303 app[1d886281] lax [info] 2022/10/07 21:15:45 init.go:490: loaded email (SMTP) messenger: username@smtp.yoursite.com
2022-10-07T21:15:45.304 app[1d886281] lax [info] ⇨ http server started on [::]:9000
2022-10-07T21:19:37.308 app[1d886281] lax [info] 2022/10/07 21:19:37 init.go:736: reloading on signal ...
2022-10-07T21:19:37.308 app[1d886281] lax [info] 2022/10/07 21:19:37 init.go:711: HTTP server shut down
2022-10-07T21:19:37.401 app[1d886281] lax [info] 2022/10/07 21:19:37 main.go:95: v2.2.0 (bbbf28c 2022-07-30T18:18:24Z)
2022-10-07T21:19:37.401 app[1d886281] lax [info] 2022/10/07 21:19:37 init.go:128: reading config: config.toml
2022-10-07T21:19:37.402 app[1d886281] lax [info] 2022/10/07 21:19:37 init.go:255: connecting to db: mb-newsletter-db.internal:5433/mb_newsletter
2022-10-07T21:19:37.606 app[1d886281] lax [info] 2022/10/07 21:19:37 init.go:566: media upload provider: filesystem
2022-10-07T21:19:37.613 app[1d886281] lax [info] 2022/10/07 21:19:37 init.go:490: loaded email (SMTP) messenger: username@smtp.yoursite.com
2022-10-07T21:19:37.615 app[1d886281] lax [info] ⇨ http server started on [::]:9000

HTTP server shut down doesn’t seem good. Does that count as the app restarting?

Apologies, I have a meeting coming up soon and may not be able to try anything new for an hour or so.

It likely does. I’m not sure. I don’t see restarts of the whole VM, which is odd to me. Unless there’s a built-in restart mechanism that doesn’t involve shutting down the process.

What happens from our view point is:

  • Server is not accepting connections (because it’s shut down)
  • Our automated health check fails (because of the former)
  • Our proxy won’t send requests to your app if that happens, but it’ll retry for a while in case it becomes healthy.
  • After a few seconds, the app appears to be listening again on the correct port and accepts the connection (passes the health check) and our proxy sends a request to it (which appears to succeed)

No problem!

Ok, thanks so much for this. Now as I try to troubleshoot, when I try to flyctl proxy my database, I get host was not found in DNS, even after restarting the database

Ok, the restart was a red herring. According to the developer:

This is by design. listmonk has to reload settings and reinitialize a number of controllers and clients, which is best done with a restart.

The delay, it must be a Fly.io thing. Best to ask them. The restart otherwise is instant and seamless.

So any idea how I might discover what’s taking the app so long to restart?

Actually, it looks like the app is restarting very quickly (In the logs I posted above, the total time between reloading on signal and http server started is about 300ms). But in the Fly.io instance of the app, the page doesn’t complete its refresh for 10 seconds.

It looks like it actually has something to do with a health process in the app that is taking 10 seconds to respond, whereas in the Railway instance, the same process is responding nearly immediately with an error, instead. So… nothing to do with Fly.io!

That’s odd!

How have you tested this? If you ssh into your instance, you should be able to curl the service.

Can you try removing all health checks from your fly.toml and see if that helps?

Hmm, does that mean every time you save the settings, it restarts the app? I can see that being simpler for listmonk! Normally this wouldn’t be a problem, but it looks like the way we do health checks (via Consul) breaks it. Turning them off might help.

Ok, it may still be a Fly.io issue after all. I think something about the Fly instance is hanging up listmonk’s call to its own api (which is what that health item is).

I said that the Railway instance was just immediately returning an error for that health process. But it turns out initially returns an error and then a 200 immediately after, whereas the Fly.io instance keeps trying health over and over (but with no errors) for 10 seconds (See screen caps below).

When I save the listmonk settings with Chrome dev tools > network, open, this is what I see. Sorted by response time, Fly instance:

And Railway instance, sorted by timeline order:

Railway initially gets an error but then immediately after gets a response from health.

For some reason, I don’t have curl when I ssh: /bin/sh: curl: not found.

But health is just an api at <my app's url>/api/health. When I visited that address in my browser, it responded in 33ms.

No change when disabling health checks and redeploying.

So it seems like what I need to hunt down is this: On Fly, the listmonk app’s call to its own /api/health address is hanging, whereas it doesn’t have that problem on Railway.

1 Like

Can you leave the health checks off for now? I’d like to test a few things.

Looks like I can hit /health (without the /api prefix) without a username / password to test. It’s fast for me right now.

I redeployed with health checks enabled about 10 minutes ago. Let me disable them again and deploy…

…ok done.