IPv6 outbound from gru to Supabase 100% packet loss

Hi team,

Since around 2026-04-28 21:27 UTC, all deploys of my Phoenix app on Fly have started failing because the BEAM can’t open new connections to my Supabase Postgres database. After investigating, the root cause is outbound IPv6 from gru is dropping 100% of packets to Supabase’s IPv6 endpoint, while existing established connections continue to work.

This was the canonical Fly + Supabase setup - generated by mix phx.gen.releasewith ECTO_IPV6=true and ERL_AFLAGS="-proto_dist inet6_tcp" - and it had been working reliably across many daily deploys.

Reproduction (from inside a running app machine)

# ping6 -c 10 -W 2 db.<myproject>.supabase.co
PING db.<myproject>.supabase.co (2600:1f1e:75b:4b14:1aef:2c9e:fcd6:8d12): 56 data bytes
--- db.<myproject>.supabase.co ping statistics ---
10 packets transmitted, 0 packets received, 100% packet loss

DNS resolution works correctly (returns 2600:1f1e:75b:4b14:1aef:2c9e:fcd6:8d12).

For comparison, IPv4 to Supabase’s pooler endpoint works fine from the same machine:

# nc -zv -w 5 aws-1-sa-east-1.pooler.supabase.com 5432
aws-1-sa-east-1.pooler.supabase.com (54.232.77.43:5432) open

Application-level symptom

In the BEAM, every Postgrex/Ecto connection attempt to the direct host hangs and gets dropped from the pool queue:


** (DBConnection.ConnectionError) [Visor.Repo] connection not available
   and request was dropped from queue after 10980ms.

Or if I add socket_options: [:inet6], Erlang resolves AAAA correctly but the TCP connection times out (consistent with the ping6 packet loss).

Environment

  • App: visor

  • Org: personal

  • Region: gru

  • Two app machines, both started

  • Image: visor:deployment-01KQB13CRCV2Y4GM3T8VJ3AA02

  • Direct host AAAA: 2600:1f1e:75b:4b14:1aef:2c9e:fcd6:8d12

What I’ve ruled out

  • DNS - resolves correctly via Erlang :inet.gethostbyname/2 and via getent ahosts.

  • Application code - no relevant changes; deploys were working until ~21:27 UTC.

  • Supabase side - their direct host AAAA is unchanged; an unrelated PostgREST incident is active but the Postgres component shows operational; the same IPv6 endpoint is reachable from my home network.

  • Connection pool exhaustion - Supabase reports 31/120 connections, no locks, no stuck migrations.

Workaround in place

Switched DATABASE_URL to the Supavisor session pooler (aws-1-sa-east-1.pooler.supabase.com:5432). App is back up.

Questions

  1. Is there a known issue with gru → AWS sa-east-1 IPv6 egress in the last 24-48h?

  2. Has anything changed in Fly’s outbound IPv6 routing recently?

  3. Is there a traceroute6 / mtr equivalent I can run from a machine to help debug, or can someone on your side check the path?

Happy to provide any logs, machine IDs, or run further tests. Thanks!

:waving_hand: Thanks for raising this! We found a subset of hosts in gru where outbound IPv6 routing was broken. As of 14:10 UTC today, a network translation fix is running on these hosts and the immediate issue is resolved. We also have work underway to prevent this from happening on other hosts.

You should be able to connect to Supabase using IPv6 again. Since this issue only affected IPv6, switching to IPv4 was the best workaround in your case (since the option was available).

Another option would be replacing the affected Machines with fly machines clone <id> then fly machines destroy <id>, which would be effective as long as the clone(s) landed on a different host without the IPv6 issue.

-Leslie

Is there any bounty program that I could’ve reached for this kind of issues?

Can you clarify what you mean by “bounty program”?

A bug bounty program, like hacker one aggregates

We don’t have a bug bounty program.

Cheers!

@flyio-support — I had possibly the same class of issue four days later in syd. Your “work underway to prevent this on other hosts” comment is the reason I’m posting here rather than opening a separate thread.

Symptom: 12-min total loss of TCP connectivity from both our web and worker to our Postgres flycast IPv6, 2026-05-02 02:20:00 → 02:32:35 UTC, syd. Spontaneous recovery, no intervention. Postgres itself stayed healthy throughout (no restart, no FATAL, no connection-limit hits).

  • App: powercoach, DB app: powercoach-db (machine 781112eb249518)

  • Affected client machines: web 178195db510238, worker 08016e6a366008

  • Error: connection to server at "fdaa:3c:8f90:0:1::2", port 5432 failed: timeout expired

Question: was there a host-level outbound-IPv6 event in syd during that window, or were any of those three machines on a host that hadn’t yet received the network-translation fix you mentioned?

Happy to file a ticket with the same data if private channels are easier — wanted to start here in case it’s directly relevant to the work this thread is already tracking.

Thanks!