This is a really odd issue.
Context:
We used to connect out to an external redis instance hosted on ComposeDB.
We migrated away from that setup. The env vars providing the ComposeDB connection string were deleted. Having completed the migration, a couple of days I deleted our old redis instance on ComposeDB.
Then on 27 Feb, something started spewing getaddrinfo ENOTFOUND our-old-instance-id.composedb.com
errors into Sentry.
Naturally my first instinct was that we still had some code trying to call out to that redis instance on compose that we had not cleaned up, but the connection string didnāt exist anywhere in our code or the production env vars. Soā¦ there must be something else somewhere with our Sentry DSN and old Redis URL. Maybe thereās an old server out there that hasnāt been shut down from before we migrated to fly? Had all our secrets been stolen somehow?
Our current docker image uses node v16.19.1
and all our instances have been deployed pretty recently.
Further investigation revealed that all the getaddrinfo ENOTFOUND our-old-instance-id.composedb.com
errors I could see in Sentry were coming from a server running node v16.18.0
which started up at 2022-10-28T02:03:41.040Z(4 months before this event)
. It reports its server name as 53b890db
So I ran flyctl status
. 53b890db
is not one of our currently running instances.
Then I ran flyctl vm status 53b890db
:
Instance
ID = 53b890db
Process = app
Version = 669
Region = ewr
Desired = stop
Status = failed
Health Checks = 1 total, 1 passing
Restarts = 18
Created = 2022-10-25T19:57:18Z
Events
TIMESTAMP TYPE MESSAGE
2022-10-25T19:57:12Z Received Task received by client
2022-10-25T19:57:12Z Task Setup Building Task Directory
2022-10-25T19:57:21Z Started Task started by client
2022-10-26T01:49:45Z Terminated OOM Killed
2022-10-26T01:49:45Z Restarting Task restarting in 1.075033318s
2022-10-26T01:49:51Z Started Task started by client
2022-10-26T03:18:01Z Terminated OOM Killed
2022-10-26T03:18:01Z Restarting Task restarting in 1.06766741s
2022-10-26T03:18:07Z Started Task started by client
2022-10-26T08:48:43Z Terminated OOM Killed
2022-10-26T08:48:43Z Restarting Task restarting in 1.196959979s
2022-10-26T08:48:49Z Started Task started by client
2022-10-26T11:21:07Z Terminated OOM Killed
2022-10-26T11:21:07Z Restarting Task restarting in 1.093784231s
2022-10-26T11:21:13Z Started Task started by client
2022-10-26T12:17:51Z Terminated OOM Killed
2022-10-26T12:17:51Z Restarting Task restarting in 1.002185957s
2022-10-26T12:17:58Z Started Task started by client
2022-10-26T19:04:46Z Terminated OOM Killed
2022-10-26T19:04:46Z Restarting Task restarting in 1.249096692s
2022-10-26T19:04:52Z Started Task started by client
2022-10-26T20:23:28Z Terminated OOM Killed
2022-10-26T20:23:28Z Restarting Task restarting in 1.248760006s
2022-10-26T20:23:34Z Started Task started by client
2022-10-26T22:04:00Z Terminated OOM Killed
2022-10-26T22:04:00Z Restarting Task restarting in 1.050022376s
2022-10-26T22:04:07Z Started Task started by client
2022-10-27T01:28:15Z Terminated OOM Killed
2022-10-27T01:28:15Z Restarting Task restarting in 1.168919129s
2022-10-27T01:28:22Z Started Task started by client
2022-10-27T02:37:34Z Terminated OOM Killed
2022-10-27T02:37:34Z Restarting Task restarting in 1.24585659s
2022-10-27T02:37:40Z Started Task started by client
2022-10-27T03:07:24Z Terminated OOM Killed
2022-10-27T03:07:24Z Restarting Task restarting in 1.176336229s
2022-10-27T03:07:30Z Started Task started by client
2022-10-27T05:39:30Z Terminated OOM Killed
2022-10-27T05:39:30Z Restarting Task restarting in 1.159094445s
2022-10-27T05:39:36Z Started Task started by client
2022-10-27T06:33:34Z Terminated OOM Killed
2022-10-27T06:33:34Z Restarting Task restarting in 1.013122023s
2022-10-27T06:33:40Z Started Task started by client
2022-10-27T10:46:42Z Terminated OOM Killed
2022-10-27T10:46:42Z Restarting Task restarting in 1.197279832s
2022-10-27T10:46:49Z Started Task started by client
2022-10-27T13:16:33Z Terminated OOM Killed
2022-10-27T13:16:33Z Restarting Task restarting in 1.141440758s
2022-10-27T13:16:39Z Started Task started by client
2022-10-27T19:17:17Z Terminated OOM Killed
2022-10-27T19:17:17Z Restarting Task restarting in 1.005445972s
2022-10-27T19:17:25Z Started Task started by client
2022-10-27T20:03:57Z Terminated OOM Killed
2022-10-27T20:03:57Z Restarting Task restarting in 1.229871888s
2022-10-27T20:04:04Z Started Task started by client
2022-10-28T02:03:34Z Terminated OOM Killed
2022-10-28T02:03:34Z Restarting Task restarting in 1.086515382s
2022-10-28T02:03:40Z Started Task started by client
2022-10-28T05:03:28Z Killing Vault: failed to derive vault token: DeriveVaultToken RPC failed: no servers
2022-10-28T05:03:31Z Template Missing: vault.read(apps/data/113240/112625), vault.read(apps/data/113240/112626), vault.read(apps/data/113240/112627), and 14 more
2022-10-28T05:08:36Z Killing Template failed: vault.read(apps/data/113240/112626): vault.read(apps/data/113240/112626): Error making API request.
URL: GET https://active.vault.service.consul:8200/v1/apps/data/113240/112626
Code: 400. Errors:
* missing client token
2022-10-28T05:10:30Z Terminated plugin is shut down
Checks
ID SERVICE STATE OUTPUT
bd5bc5afcbd9bf8bed1c433dd758c330 tcp-80 passing TCP connect 172.19.1.114:80: Success
so that does appear to correspond to an old fly VM that was ours at some point (which is a relief - it seems nobody stole all our secrets). The last time it started was 2022-10-28T02:03:40Z
(which roughly corresponds to what is reported in sentry), but why is it spewing errors into sentry today?
Having dug into it a bit more, weāve got a total of ~120k of these error events in sentry over 2 days. The flood seems to have stopped ~2 hours ago. About 40k events each are from 3 instances: 53b890db
, 6eb27ad2
and 88945859
. I havenāt included the details for 6eb27ad2
and 88945859
but they follow a similar pattern: the last attempted start shows 2022-10-28, flyctl status
doesnāt report them being in our cluster and theyāve all chucked ~40k events into our sentry over the last couple of days (or at least something reporting to be them has). Also none of those IDs show up in the list under āVM Service Concurrencyā in our metrics dashboard.
So it appears to me that some old VMs (which are not visible via flyctl status
) may somehow still be hanging around. Thereās a few concerning things about this:
- Given each of these VMs have managed to throw >40,000 errors into sentry in the space of a day I am worried they may actually have been accepting production traffic. It is hard to tell
- I am unclear how we can identify if there are āzombieā instances hanging around. The only reason I noticed these existed was because we shut down an external service and they spammed Sentry. If we hadnāt done then weād never have known
- I am worried we could be getting billed for compute time for these instances
Is anyone able to shed any light on what might have happened here? Iām aware this is a very strange issue.