I’m facing a strange issue with my LiveView app: I just did a fresh staging deploy and immediately after got the “Error 1 Undocumented” issue. I’ve seen this refer to hitting a concurrency limit (e.g. in proxy error.message=["Undocumented"]) but the metrics on the fly dashboard say I have 1 connection to staging.
Here’s my staging logs
2021-11-23T16:47:29.645 proxy[ab37eb8b] chi [error] Error 1: Undocumented
2021-11-23T16:47:30.837 app[72d6c75a] ord [info] 16:47:30.836 request_id=Fro6yVG39f9AWpsAAANh [info] GET /campaigns
2021-11-23T16:47:30.910 app[72d6c75a] ord [info] 16:47:30.909 request_id=Fro6yVG39f9AWpsAAANh [info] Sent 200 in 73ms
2021-11-23T16:47:38.655 app[72d6c75a] ord [info] 16:47:38.655 [info] CONNECTED TO Phoenix.LiveView.Socket in 36µs
2021-11-23T16:47:38.655 app[72d6c75a] ord [info] Transport: :websocket
2021-11-23T16:47:38.655 app[72d6c75a] ord [info] Serializer: Phoenix.Socket.V2.JSONSerializer
2021-11-23T16:47:38.655 app[72d6c75a] ord [info] Parameters: %{"_csrf_token" => "DwkwHy4_KhpDOxUGFDEeDxYOaFtVUzZPzKzxatRMtHFPPVFYzyZ02jix", "_mounts" => "0", "vsn" => "2.0.0"}
I think this is related to the issue I had in production this morning where after that error my app was down
2021-11-23T14:27:24.637 proxy[2daf3929] chi [error] error.code=1 error.message="Undocumented" request.method="GET" request.url="/live/websocket?_csrf_token=AUVZZzk0NQIkWzY8LytTLxcmOTkXc3Mtxr53ZgZjMjDSfMaxzwjRmG8N&_mounts=6&vsn=2.0.0" request.id="01FN6K2GN5H3NY674P1BAXJCFE" response.status=502
2021-11-23T14:32:45.052 proxy chi [error] error.code=1002 error.message="No suitable (healthy) instance found to handle request" request.method="GET" request.url="/live/websocket?_csrf_token=ZwE4CwFhExsnYUdsUmpVZUAABi5YQTBG6lyCJ8PHASr4f9x39FdDk3Ev&_mounts=40&vsn=2.0.0" request.id="01FN6KBFS9GB4EGKHK9H2JV5CS" response.status=502
2021-11-23T14:32:45.095 proxy chi [error] error.code=1002 error.message="No suitable (healthy) instance found to handle request" request.method="GET" request.url="/live/websocket?_csrf_token=P3UdGAobIj0PYEEcewoEYyo8GS5XXy8GFBqLiHMUfQ3s2l64GmJE-kde&_mounts=7&vsn=2.0.0" request.id="01FN6KBFTDTJ1TE0ZFPJD0CFEX" response.status=502
Production is back up and working fine, there hasn’t been another deploy or any restarts, but I no longer get any logs for the instance (2daf3929) even as the app is responding to traffic.
This all happened after what I thought was a trivial change/deploy last night. Is it possible that there was some downtime/incident on fly’s end or is there a bug with my code or config I’m hitting?
What’s the fly status on the app? It’s useful to know how many instances are running and where they are.
In this particular case, it seems possible that the connection was cut ( the umbrella / catch all error.code=1) as the app went down / was terminated for whatever reason (fly checks list should tell you which health checks are failing). There seem to have been no good instances available on the next couple of requests (error.code=1002).
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
2daf3929 app 122 yyz run running 1 total, 1 passing 0 15h15m ago
The uptime corresponds with the deploy last night and there are zero restarts. The app is up (e.g. when I open it in a browser), but I’m no longer getting any recent logs in flyctl logs, and my test with deploying to staging indicates that I should be getting normal logs.
I’ve deployed 3 Phoenix apps to Fly, and they all have the “undocumented” error constantly. Today I figured out how to consistently reproduce the issue. I believe the issue affects any LiveView websocket, but I’ve been reproducing it using Live Dashboard. These steps can be used to reproduce the problem:
Generate an empty Phoenix app using mix phx.new.
The generated router.ex file mounts Live Dashboard in the development environment only. Update this file so that the router is mounted in all environments.
Deploy to Fly.
Visit the live dashboard in Google Chrome (probably any browser). Leave this browser window open during the next step.
Restart the fly instance using fly restart or by deploying.
An error like this will show up in the logs:
2021-12-16T21:48:31.046 runner[9c376926] iad [info] Shutting down virtual machine
2021-12-16T21:48:31.179 app[9c376926] iad [info] Sending signal SIGTERM to main child process w/ PID 510
2021-12-16T21:48:31.179 app[9c376926] iad [info] 21:48:31.178 [notice] SIGTERM received - shutting down
2021-12-16T21:48:37.123 proxy[9c376926] iad [error] error.code=2000 error.message="problem connecting to app instance"
2021-12-16T21:48:37.160 proxy[9c376926] lax [error] error.code=1 error.message="Undocumented" request.method="GET" request.url="/live/websocket?_csrf_token=KiZXUS8fBgsUICoFISohDyARehoQQxAjgI09XQKXuiPVBeUYJc2HQ9us&&vsn=2.0.0" request.id="01FQ2KFV6HG9D1DD4168H6TGCG" response.status=502
2021-12-16T21:48:37.185 app[9c376926] iad [info] Reaped child process with pid: 570 and signal: SIGUSR1, core dumped? false
2021-12-16T21:48:38.080 proxy[9c376926] iad [error] error.code=2000 error.message="problem connecting to app instance"
2021-12-16T21:48:38.117 proxy[9c376926] lax [error] error.code=1 error.message="Undocumented" request.method="GET" request.url="/live/websocket?_csrf_token=KiZXUS8fBgsUICoFISohDyARehoQQxAjgI09XQKXuiPVBeUYJc2HQ9us&&vsn=2.0.0" request.id="01FQ2KFW4CZCHR1G0QP6QNMPH9" response.status=502
2021-12-16T21:48:38.187 app[9c376926] iad [info] Main child exited normally with code: 0
2021-12-16T21:48:38.188 app[9c376926] iad [info] Starting clean up.
2021-12-16T21:48:39.013 proxy[9c376926] iad [error] error.code=2 error.message="Internal problem"
2021-12-16T21:48:39.048 proxy[9c376926] lax [error] error.code=1 error.message="Undocumented" request.method="GET" request.url="/live/websocket?_csrf_token=KiZXUS8fBgsUICoFISohDyARehoQQxAjgI09XQKXuiPVBeUYJc2HQ9us&&vsn=2.0.0" request.id="01FQ2KFX1KJ3GWDT2TFYT6NBPZ" response.status=502
2021-12-16T21:48:39.994 proxy[9c376926] iad [error] error.code=2 error.message="Internal problem"
2021-12-16T21:48:40.027 proxy[9c376926] lax [error] error.code=1 error.message="Undocumented" request.method="GET" request.url="/live/websocket?_csrf_token=KiZXUS8fBgsUICoFISohDyARehoQQxAjgI09XQKXuiPVBeUYJc2HQ9us&&vsn=2.0.0" request.id="01FQ2KFY098B6Z2JH55M41GF2H" response.status=502
2021-12-16T21:48:41.027 proxy[9c376926] iad [error] error.code=2 error.message="Internal problem"
2021-12-16T21:48:41.065 proxy[9c376926] lax [error] error.code=1 error.message="Undocumented" request.method="GET" request.url="/live/websocket?_csrf_token=KiZXUS8fBgsUICoFISohDyARehoQQxAjgI09XQKXuiPVBeUYJc2HQ9us&&vsn=2.0.0" request.id="01FQ2KFZ0GMT6S1T06Z9A3KGRJ" response.status=502
2021-12-16T21:48:42.108 proxy[9c376926] iad [error] error.code=2 error.message="Internal problem"
2021-12-16T21:48:42.141 proxy[9c376926] lax [error] error.code=1 error.message="Undocumented" request.method="GET" request.url="/live/websocket?_csrf_token=KiZXUS8fBgsUICoFISohDyARehoQQxAjgI09XQKXuiPVBeUYJc2HQ9us&&vsn=2.0.0" request.id="01FQ2KG0295BR1APN5M6FBC2J0" response.status=502
2021-12-16T21:48:43.439 proxy[9c376926] iad [error] error.code=2 error.message="Internal problem"
2021-12-16T21:48:43.474 proxy[9c376926] lax [error] error.code=1 error.message="Undocumented" request.method="GET" request.url="/live/websocket?_csrf_token=KiZXUS8fBgsUICoFISohDyARehoQQxAjgI09XQKXuiPVBeUYJc2HQ9us&&vsn=2.0.0" request.id="01FQ2KG1C0KXEP5AV5XK519WWJ" response.status=502
2021-12-16T21:48:43.678 runner[9c376926] iad [info] Starting instance
2021-12-16T21:48:43.719 runner[9c376926] iad [info] Configuring virtual machine
2021-12-16T21:48:43.720 runner[9c376926] iad [info] Pulling container image
2021-12-16T21:48:43.924 runner[9c376926] iad [info] Unpacking image
2021-12-16T21:48:43.929 runner[9c376926] iad [info] Preparing kernel init
2021-12-16T21:48:44.432 runner[9c376926] iad [info] Configuring firecracker
2021-12-16T21:48:44.467 runner[9c376926] iad [info] Starting virtual machine
When the server restarts, the LiveView websocket disconnects. LiveView automatically tries to reconnect the websocket, but for some reason this request gets routed to lax and I get the “undocumented” error. I don’t have any instances running in lax so it makes sense that this request fails. But why is Fly routing requests to a region where I don’t have instances? This is very concerning.
When you deploy, our edges are sometimes 2 min slow detecting the new VMs (and noticing the old one is gone). For apps running a single instance, this causes errors during deploys.
The messages you’re seeing are from the lax edge and the iad worker. They come in pairs because a connection hits lax, it sends it to where the now-terminated VM was (iad, in this case), and then both ends of the proxy emit an error.
We’re working on this, for now the workaround is to scale to ~3 VMs.
Thanks! I’m glad to hear that you’re working on the problem.
Heads up that I’ve tried scaling to 3 instances (fly scale count 3), waiting until all health checks pass, waited another 5 minutes, then deployed. I still saw the bug. I repeated with 6 instances and saw the bug again.
Is there a way to configure the rolling restart to wait longer between restarts? Or is there any other way to work around this issue in the meantime? Thanks.
Those errors may not actually be user facing. When there is more than one VM available, we will retry against another one. You’ll see an error in the region the VM is running in, and then the edge proxy will connect to a different VM.
If you look at metrics on fly.io/apps/<name>/metrics and don’t see 502s, everything is working as it should be.
You can’t control the delay between rolling restarts directly. You can catch the SIGINT and wait up to 5 minutes before exiting, though, which might help.