I am currently running Elixir Phoenix application in production on fly. And I have started encountering issues - first with database outages (likely caused by alpine base image I used) and now after update with proxy not being able to route requests to the instances.
Can anyone from Fly staff kindly help me with diagnosis of this issue? I am willing to accept some reasonable amount of down time due to me running the apps as a single instance but to have a 10min outage every other day seems more than a little excessive.
The errors:
error.message="could not find an instance to route to" 2022-12-21T13:59:12Z proxy fra [error]request.method="POST" request.url="<HOST>/api" request.id="01GMTFKP7F5QACA4XH7KE7K8NK-fra" response.status=503
It is also worth noting that neither memory nor ram of the application experienced any kind of irregularity prior to the outage.
This usually happens when an instance is unhealthy. If you run fly status --all you can see a list of the VMs it might have been trying to use. fly vm status <id> may show health checks failing.
It’s also possible the process was just failing to accept connections from our proxy. Our proxy temporarily marks instances “down” when it can’t connect to them.
Running a single instance increases the risk of down due to changes in our infrastructure and app issues themselves. Outages every day are most likely an app issue, 2+ instances will let our proxy route around an unhealthy VM.
As I was saying before I do not expect 2+ nines of uptime on a single instance deployment and unfortunately I cannot increase a number of instances without first modifying the source code of the application (which is planned but will take some time).
That I believe but the question is why as my first reaction was to try ssh console into the application and it was not possible. Similarly the application also stopped producing any sort of logs. Furthermore when I used flyctl to restart the application it did not do anything which leads me to believe that the issue is not the with the application but rather with the infrastructure.
Yes I am sure it will show the health-check failing, however the reason why its failing is what interests me and I do not have access to any information that would help me there.
I am unfortunately going to be forced to migrate away from Fly and revisit it in the future when the platform is more mature.
Hey Michael – if you couldn’t SSH in and logs stopped, it’s actually a sign the app process failed. You should definitely dig through fly status --all and look for instances with restarts.
Hello Kurt,
you keep pointing me towards that command which is nice to have in case of future crashes but does nothing for me that could prevent the situation in the future.
I have to strongly disagree with this statement. I am assuming that by the app process you mean the my (Elixir) application and not the firecracker process. This is definitely wrong as it would not explain the lack of logs on the contrary it should cause the micro-vm to go to crash-loop and produce large amount of logs and stack traces which has not happened.
This is further backed by the fact that your logs normally show startup message with runner label such as:
2022-12-22T21:43:48Z runner[46eb4xxx] fra [info]Starting instance
which proves that there should be at least some log messages should the application process be failing completely silently (which is not possible as far as I know in this case).
My personal bet is that you have some issues with your networking judging by the no route error and I recommend you investigate it further.