In the machine logs, once the second machine is auto-stopped, I see:
... reboot: Restarting system
... Health check on port 8108 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.
Is this to be expected? I’m surprised that 1) the system restarts and 2) health checks continue after auto-stopping. Any explanation or insight would be appreciated!
Go back a bit further in your app logs to see what’s happening when the Machines stop. If the Machine is being stopped by the auto stop feature, then you’ll see a message about “downscaling” from 2 Machines to 1. If you don’t see that log, then something in your app is stopping the Machine.
Also, if you have min_machine_running=1, then the second Machine (or the Machine in the primary region if you have machines in different regions), should never be stopped by the auto stop feature.
The node might be failing because there aren’t enough nodes? Typesense recommends either single node or 3-5 nodes for HA. That could also be the issue with running 2 Machines in this case.
I think you’re right, typesense is expecting to connect the two machine instances as nodes.
My intent is not to run an HA cluster, but just to have a single instance of typesense running, with a backup machine available to start in a different region in case the primary machine goes offline. (It’s not a complex typesense database, just populated by a scraper for a small website that runs intermittently and on machine startup.)
I might have this wrong, but I thought that the two machines would be running separate instances of typesense, and since I didn’t specify any nodes in the typesense env, I assumed the two instances of typesense should have no “knowledge” of each other. So I’m surprised that the second one (and only the second one) is attempting to be a node in a cluster.
Although I can’t find anything about disabling the cluster function, there’s an option to “Force one of the nodes to become a single-node cluster by editing its nodes file to contain just its own IP address”.
I could create that nodes file in my Dockerfile, but I’d need to programmatically create a unique file in each machine that specifies that machine’s unique internal IP address as its only node. Is there a way to do that on deploy? Get the fly builder to return the internal IP address of the particular machine it’s building to populate typesense’s nodes file?
Yes, exactly, and the data population only takes about a 90-second github action run.
I don’t think there was anything between them that one particular time. But now there is. Here’s the most recent log. It looks like typesense is restarting itself immediately, or was never killed:
2023-08-28T09:18:55.107 proxy <MACHINE_ID> <REGION> [info] Downscaling app typesense-nik9fiyd in region <REGION> from 1 machines to 0 machines. Automatically stopping machine MACHINE_ID
2023-08-28T09:18:55.112 app<MACHINE_ID> <REGION> [info] INFO Sending signal SIGINT to main child process w/ PID 255
2023-08-28T09:19:00.124 app<MACHINE_ID> <REGION> [info] INFO Sending signal SIGTERM to main child process w/ PID 255
2023-08-28T09:19:00.665 app<MACHINE_ID> <REGION> [info] INFO Main child exited with signal (with signal 'SIGTERM', core dumped? false)
2023-08-28T09:19:00.665 app<MACHINE_ID> <REGION> [info] INFO Starting clean up.
2023-08-28T09:19:00.666 app<MACHINE_ID> <REGION> [info] WARN hallpass exited, pid: 256, status: signal: 15 (SIGTERM)
2023-08-28T09:19:00.666 app<MACHINE_ID> <REGION> [info] 2023/08/28 09:19:00 listening on <IPV6_ADDRESS>:22 (DNS: [fdaa::3]:53)
2023-08-28T09:19:01.424 app<MACHINE_ID> <REGION> [info] I20230828 09:16:56.409983 348 raft_server.cpp:564] Term: 2, last_index index: 3, committed_index: 3, known_applied_index: 3, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 8
2023-08-28T09:19:01.424 app<MACHINE_ID> <REGION> [info] I20230828 09:16:56.410104 363 raft_server.h:60] Peer refresh succeeded!
... <similar typesense messages> ...
2023-08-28T09:19:01.424 app<MACHINE_ID> <REGION> [info] I20230828 09:18:56.423753 363 raft_server.h:60] Peer refresh succeeded!
2023-08-28T09:19:01.424 app<MACHINE_ID> <REGION> [info] I20230828 09:19:00.660526 261 typesense_server_utils.cpp:53] Stopping Typesense server...
2023-08-28T09:19:01.424 app<MACHINE_ID> <REGION> [info] I20230828 09:19:01.424450 348 typesense_server_utils.cpp:314] Typesense peering service is going to quit.
2023-08-28T09:19:01.424 app<MACHINE_ID> <REGION> [info] I20230828 09:19:01.424505 348 raft_server.cpp:829] Set shutting_down = true
2023-08-28T09:19:01.661 app<MACHINE_ID> <REGION> [info] [ 375.713067] reboot: Restarting system
2023-08-28T09:19:01.678 app<MACHINE_ID> <REGION> [info] I20
2023-08-28T09:19:09.756 health<MACHINE_ID> <REGION> [error] Health check on port 8108 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.
I’m using the “gross” bash method of running multiple processes in order to run typesense plus the startup scripts. Maybe that’s preventing properly killing typesense? Startup script looks like this:
set -m # turn on bash's job control
/opt/typesense-server --data-dir /data --api-key $<API_KEY> --enable-cors &
sleep 1 &&
<curl command to add API key to typesense>
<curl command to add API key to typesense>
<curl command to run github workflow to populate typesense db>
and fly.toml has kill_signal = "SIGINT" and kill_timeout = 5
The logs seem to show that the Fly Proxy starts the downscaling, then there are a few more peer refreshes, and then typesense seems to stop and shut down gracefully?
So the unexpected behaviour is the empty/cut off “I20” log and the health check failure (which is a full 8 seconds later!).
If you test things out and find the second Machine stays stopped after that last health check log, and then successfully starts up when you destroy the first Machine, then you should be good. If not, we can dig deeper.
Last thing: You could try setting the kill_timeout to something just a little higher, in case the Fly Proxy flyd shutdown is interrupting the graceful typesense shutdown.
The second machine doesn’t start up when the first machine is destroyed or stopped. Nor vice versa. So it looks like there’s a problem with typesense gracefully shutting down.
I went all the way up to 60 for kill_timeout but got the same results: health check failure 12 seconds later.
Incidentally, the typesense node-related logs are back.
Could the ungraceful shutdown be related to using my start.sh from the end of this post?
Or perhaps my idea for (hopefully) forcing each typesense instance to think of itself as a single node might work, if you have suggestions for getting the machine’s unique IP address into start.sh? (See below)
I did check internally about the health check after the Machine is stopped and I think since we don’t see any more health check failures after that and the Machine does stop, then it’s okay. The health check and the Machine stop are done by different parts of the system and there’s the possibility for a short delay. It makes sense that the last health check would fail, since the Machine is stopped!
re: backup Machine
I might have used some confusing wording about when the second Machine should start in this config. Let’s call your Machine in the primary region Machine 1 and the “backup” Machine in a another region Machine 2.
Machine 1 will never be automatically stopped (because of min_machine_running=1).
you start it manually (after which it will “downscale” automatically)
If you just manually stop Machine 1, then on the next request, it will probably be Machine 1 that starts up again. If you actually destroy, not just stop, Machine 1, then Machine 2 will () start when the next request is received by the app. Then we can confirm whether the app keeps working or not. (I know you tried destroying before, but I’m not sure if you tried connecting to the app?)
Is your app working as expected now despite the log messages?
Yes! Thanks for your detailed reply. The info that the app needed to actually receive a request to start up was the missing piece I needed. In that case, everything seems to be working as expected. Thanks!
I wonder if Fly Proxy is load balancing and sending requests to your second Machine when users are closest to that region? I don’t think it would do that, but I’m not totally clear on how load balancing and auto start / stop interact or if one overrides the other. I will look into it though!
Thanks, I’ll look into that. In the meantime, I continued seeing issues on Machine 2 with typesense attempting to be part of a HA node cluster. I’m considering that as a possibility for all the restarts.
So to test that theory, I want to configure each machine as a single-node cluster so each instance (Machine 2, really, which has the problem with it) is satisfied that it’s its own node.
To do that, I need to enter a single IP address in a file referenced by the entrypoint command. The IP entry in /etc/hosts under # Private address for this instance seems to work fine. (Using fly-global-services or fly-local-6pn doesn’t work).
That private address is unique to each machine, which is what I want. Is there a way to access that address other than doing a grep of /etc/hosts to retrieve it?
I figured it out. I don’t have to do a kludgy grep of /etc/hosts, I can simply cat /etc/hostname within the VM to get the machine’s unique private hostname (which is the machine’s id).
But setting up each machine as a single-node cluster didn’t help with the restart issues.
On top of that, it seems that having two machines that aren’t connected as a HA cluster causes intermittent problems where the typesense search database sometimes isn’t populated after a re-deploy. Perhaps something to do with routing issues that are beyond me to figure out.
At this point, I’m going to abandon the cost-saving plan of having a cloned machine on standby, and instead either risk using a single machine and occasional outages, or run 3 machines and do a legit HA cluster.
if you are interested in seeing how I ended up setting it up, let me know. It’s a hacked together solution from the Fly NATS cluster. I don’t know Go so needless to say it could probably be done better haha