This is a really odd issue.
We used to connect out to an external redis instance hosted on ComposeDB.
We migrated away from that setup. The env vars providing the ComposeDB connection string were deleted. Having completed the migration, a couple of days I deleted our old redis instance on ComposeDB.
Then on 27 Feb, something started spewing
getaddrinfo ENOTFOUND our-old-instance-id.composedb.com errors into Sentry.
Naturally my first instinct was that we still had some code trying to call out to that redis instance on compose that we had not cleaned up, but the connection string didn’t exist anywhere in our code or the production env vars. So… there must be something else somewhere with our Sentry DSN and old Redis URL. Maybe there’s an old server out there that hasn’t been shut down from before we migrated to fly? Had all our secrets been stolen somehow?
Our current docker image uses
node v16.19.1 and all our instances have been deployed pretty recently.
Further investigation revealed that all the
getaddrinfo ENOTFOUND our-old-instance-id.composedb.com errors I could see in Sentry were coming from a server running
node v16.18.0 which started up at
2022-10-28T02:03:41.040Z(4 months before this event). It reports its server name as
So I ran
53b890db is not one of our currently running instances.
Then I ran
flyctl vm status 53b890db:
Instance ID = 53b890db Process = app Version = 669 Region = ewr Desired = stop Status = failed Health Checks = 1 total, 1 passing Restarts = 18 Created = 2022-10-25T19:57:18Z Events TIMESTAMP TYPE MESSAGE 2022-10-25T19:57:12Z Received Task received by client 2022-10-25T19:57:12Z Task Setup Building Task Directory 2022-10-25T19:57:21Z Started Task started by client 2022-10-26T01:49:45Z Terminated OOM Killed 2022-10-26T01:49:45Z Restarting Task restarting in 1.075033318s 2022-10-26T01:49:51Z Started Task started by client 2022-10-26T03:18:01Z Terminated OOM Killed 2022-10-26T03:18:01Z Restarting Task restarting in 1.06766741s 2022-10-26T03:18:07Z Started Task started by client 2022-10-26T08:48:43Z Terminated OOM Killed 2022-10-26T08:48:43Z Restarting Task restarting in 1.196959979s 2022-10-26T08:48:49Z Started Task started by client 2022-10-26T11:21:07Z Terminated OOM Killed 2022-10-26T11:21:07Z Restarting Task restarting in 1.093784231s 2022-10-26T11:21:13Z Started Task started by client 2022-10-26T12:17:51Z Terminated OOM Killed 2022-10-26T12:17:51Z Restarting Task restarting in 1.002185957s 2022-10-26T12:17:58Z Started Task started by client 2022-10-26T19:04:46Z Terminated OOM Killed 2022-10-26T19:04:46Z Restarting Task restarting in 1.249096692s 2022-10-26T19:04:52Z Started Task started by client 2022-10-26T20:23:28Z Terminated OOM Killed 2022-10-26T20:23:28Z Restarting Task restarting in 1.248760006s 2022-10-26T20:23:34Z Started Task started by client 2022-10-26T22:04:00Z Terminated OOM Killed 2022-10-26T22:04:00Z Restarting Task restarting in 1.050022376s 2022-10-26T22:04:07Z Started Task started by client 2022-10-27T01:28:15Z Terminated OOM Killed 2022-10-27T01:28:15Z Restarting Task restarting in 1.168919129s 2022-10-27T01:28:22Z Started Task started by client 2022-10-27T02:37:34Z Terminated OOM Killed 2022-10-27T02:37:34Z Restarting Task restarting in 1.24585659s 2022-10-27T02:37:40Z Started Task started by client 2022-10-27T03:07:24Z Terminated OOM Killed 2022-10-27T03:07:24Z Restarting Task restarting in 1.176336229s 2022-10-27T03:07:30Z Started Task started by client 2022-10-27T05:39:30Z Terminated OOM Killed 2022-10-27T05:39:30Z Restarting Task restarting in 1.159094445s 2022-10-27T05:39:36Z Started Task started by client 2022-10-27T06:33:34Z Terminated OOM Killed 2022-10-27T06:33:34Z Restarting Task restarting in 1.013122023s 2022-10-27T06:33:40Z Started Task started by client 2022-10-27T10:46:42Z Terminated OOM Killed 2022-10-27T10:46:42Z Restarting Task restarting in 1.197279832s 2022-10-27T10:46:49Z Started Task started by client 2022-10-27T13:16:33Z Terminated OOM Killed 2022-10-27T13:16:33Z Restarting Task restarting in 1.141440758s 2022-10-27T13:16:39Z Started Task started by client 2022-10-27T19:17:17Z Terminated OOM Killed 2022-10-27T19:17:17Z Restarting Task restarting in 1.005445972s 2022-10-27T19:17:25Z Started Task started by client 2022-10-27T20:03:57Z Terminated OOM Killed 2022-10-27T20:03:57Z Restarting Task restarting in 1.229871888s 2022-10-27T20:04:04Z Started Task started by client 2022-10-28T02:03:34Z Terminated OOM Killed 2022-10-28T02:03:34Z Restarting Task restarting in 1.086515382s 2022-10-28T02:03:40Z Started Task started by client 2022-10-28T05:03:28Z Killing Vault: failed to derive vault token: DeriveVaultToken RPC failed: no servers 2022-10-28T05:03:31Z Template Missing: vault.read(apps/data/113240/112625), vault.read(apps/data/113240/112626), vault.read(apps/data/113240/112627), and 14 more 2022-10-28T05:08:36Z Killing Template failed: vault.read(apps/data/113240/112626): vault.read(apps/data/113240/112626): Error making API request. URL: GET https://active.vault.service.consul:8200/v1/apps/data/113240/112626 Code: 400. Errors: * missing client token 2022-10-28T05:10:30Z Terminated plugin is shut down Checks ID SERVICE STATE OUTPUT bd5bc5afcbd9bf8bed1c433dd758c330 tcp-80 passing TCP connect 172.19.1.114:80: Success
so that does appear to correspond to an old fly VM that was ours at some point (which is a relief - it seems nobody stole all our secrets). The last time it started was
2022-10-28T02:03:40Z (which roughly corresponds to what is reported in sentry), but why is it spewing errors into sentry today?
Having dug into it a bit more, we’ve got a total of ~120k of these error events in sentry over 2 days. The flood seems to have stopped ~2 hours ago. About 40k events each are from 3 instances:
88945859. I haven’t included the details for
88945859 but they follow a similar pattern: the last attempted start shows 2022-10-28,
flyctl status doesn’t report them being in our cluster and they’ve all chucked ~40k events into our sentry over the last couple of days (or at least something reporting to be them has). Also none of those IDs show up in the list under “VM Service Concurrency” in our metrics dashboard.
So it appears to me that some old VMs (which are not visible via
flyctl status) may somehow still be hanging around. There’s a few concerning things about this:
- Given each of these VMs have managed to throw >40,000 errors into sentry in the space of a day I am worried they may actually have been accepting production traffic. It is hard to tell
- I am unclear how we can identify if there are ‘zombie’ instances hanging around. The only reason I noticed these existed was because we shut down an external service and they spammed Sentry. If we hadn’t done then we’d never have known
- I am worried we could be getting billed for compute time for these instances
Is anyone able to shed any light on what might have happened here? I’m aware this is a very strange issue.