IPv6 outbound from `fra` to public AAAA hosts: 100% timeout (no v6 default route)

mvodopija · May 6, 2026, 12:08pm

Hi team,

Since around 2026-05-06 ~08:35 UTC, app machines in fra have lost public IPv6 egress. There is no IPv6 default route inside the container, and every TCP/HTTP attempt to a public AAAA host (Google, Cloudflare, AWS Cognito eu-central-1) times out. IPv4 to the same hosts works. 6PN-internal IPv6 between Fly machines and to Postgres on *.internal still works fine — the machine is reachable over its fdaa:… 6PN address for SSH and the app’s database connections are healthy.

This affects every machine in the app: a fly machine restart did not change the symptom, and a fly secrets-driven rolling restart that produced a new machine version also did not. So this isn’t a single bad host.

Reproduction (run inside any started machine)

Save as fly-ipv6-repro.sh and pipe through fly ssh console:

fly ssh console --app <your-app> -C "bash -s" < fly-ipv6-repro.sh

…or just paste the body of the script directly into an interactive fly ssh console session. Script:

#!/usr/bin/env bash
# Self-contained IPv6 egress reproducer. Run from inside a Fly machine.
# Exit codes inside the body don't matter; we just print pass/fail per check.
set -u

echo "=== machine=$(hostname) uptime=$(uptime -p) when=$(date -u +%FT%TZ) ==="

echo
echo "--- IPv6 default route ---"
ip -6 route show default || true

echo
echo "--- DNS for www.google.com (returns both A and real AAAA) ---"
getent ahosts www.google.com | sort -u

echo
echo "--- IPv4 control: wget -4 https://www.google.com ---"
timeout 8 wget -q -O /dev/null --timeout=6 --tries=1 -4 https://www.google.com \
  && echo IPv4_OK || echo IPv4_FAIL_$?

echo
echo "--- IPv6 wget to https://www.google.com (real AAAA) ---"
timeout 10 wget -q -O /dev/null --timeout=8 --tries=1 -6 https://www.google.com \
  && echo IPv6_OK || echo IPv6_FAIL_$?

echo
echo "--- IPv6 wget to https://www.cloudflare.com (different AS, real AAAA) ---"
timeout 10 wget -q -O /dev/null --timeout=8 --tries=1 -6 https://www.cloudflare.com \
  && echo IPv6_OK || echo IPv6_FAIL_$?

echo
echo "--- Raw TCP6 SYN to a public v6 IP:443 (no TLS, no HTTP) ---"
v6=$(getent ahostsv6 www.google.com | awk '!/::ffff:/{print $1; exit}')
echo "target=$v6"
timeout 8 bash -c "</dev/tcp/$v6/443" \
  && echo TCP6_OK || echo TCP6_FAIL_$?

echo
echo "--- 6PN sanity: ping Fly internal DNS over IPv6 ---"
timeout 6 ping -6 -c 2 -W 2 fdaa:0:1::3 || true

Observed output

=== machine=<redacted> uptime=up 35 minutes when=2026-05-06T11:46:48Z ===

--- IPv6 default route ---
(empty — no v6 default route configured)

--- DNS for www.google.com ---
142.251.157.119          STREAM www.google.com
2001:4860:4826:7700::    STREAM
2001:4860:4827:7700::    STREAM
2001:4860:4828:7700::    STREAM
...

--- IPv4 control: wget -4 https://www.google.com ---
IPv4_OK

--- IPv6 wget to https://www.google.com (real AAAA) ---
IPv6_FAIL_124

--- IPv6 wget to https://www.cloudflare.com (different AS, real AAAA) ---
IPv6_FAIL_124

--- Raw TCP6 SYN to a public v6 IP:443 ---
target=2001:4860:4829:7700::
TCP6_FAIL_124

--- 6PN sanity: ping Fly internal DNS over IPv6 ---
2 packets transmitted, 2 received, 0% packet loss

FAIL_124 = the outer timeout killed the call. Without the outer guard, wget -6 itself eventually exits via its --timeout after a much longer wall-clock — typical observed time before it gives up and falls back to v4 is ~30s.

Application-level symptom

We’re a .NET 10 ASP.NET Core app using Microsoft.AspNetCore.Authentication.JwtBearer against AWS Cognito eu-central-1. Cognito is one of the few public AWS endpoints that publishes real AAAA records (most others, like s3.eu-central-1.amazonaws.com, return only IPv4-mapped ::ffff:… addresses, so connects go via v4 anyway). Microsoft.IdentityModel.Protocols.OpenIdConnect.ConfigurationManager uses a managed HttpClient to fetch the OIDC discovery doc; on Linux it does not implement Happy-Eyeballs the way wget/curl do, so it stalls on the v6 connect for the full backchannel timeout, never gets a usable BaseConfiguration, and on every authenticated request the JwtBearer middleware throws:

Microsoft.IdentityModel.Tokens.SecurityTokenInvalidIssuerException:
IDX10204: Unable to validate issuer. validationParameters.ValidIssuer is null
or whitespace AND validationParameters.ValidIssuers is null or empty.

…which produces a 401 to the client with error="invalid_token", error_description="The issuer '<authority>' is invalid". So at the app level it looks like a Cognito issuer-mismatch bug, but it’s actually a missing OIDC config caused by failed v6 egress.

Environment

Region: fra
Image: based on mcr.microsoft.com/dotnet/aspnet:10.0
Process group: app, two app machines (min_machines_running = 1, rolling deploy)
6PN intra-org IPv6: working
Public IPv4 egress: working
Public IPv6 egress: 100% timeout (no default route)

What I’ve ruled out

DNS — getent ahostsv6 returns expected real AAAA records.
Per-host route — fails identically to Google, Cloudflare, AWS Cognito (different ASes).
Bad single host — issue persists across fly machine restart and across a fly secrets-triggered rolling restart that bumped to a new machine version. Both machines in the app affected.
Cognito side — fetching the OIDC discovery doc over IPv4 from inside the same machine returns the expected issuer field. Customer side and config are fine.
Process-level disable as a workaround — setting DOTNET_SYSTEM_NET_DISABLEIPV6=1 immediately broke the app: Postgres on *.internal is IPv6-only over 6PN, so disabling v6 process-wide killed DB connectivity at startup (Npgsql.NpgsqlException: Name or service not known). Reverted within seconds.

Workaround in place

Configuring JwtBearerOptions.BackchannelHttpHandler with a SocketsHttpHandler.ConnectCallback that filters DNS results to AddressFamily.InterNetwork — IPv4 only, scoped to OIDC/JWKS traffic. 6PN/Postgres continues over IPv6 unaffected. Deploying that now.

Questions

Is public IPv6 egress from fra knowingly broken right now, or specifically for our org? A peer reported the same symptom shape from gru to a Supabase AAAA endpoint a few days ago — could be the same root cause surfacing in a different region.
Can someone on staff check the v6 default-route advertisement / RA on our machines in fra without us needing to reproduce live? Happy to share machine IDs and the app name in a DM.
Is there a traceroute6 / mtr -6 you’d like us to run for one diagnostic pass? Neither is in our runtime image, but we can apt-get install once if it’d help.

Thanks!

lillian · May 6, 2026, 12:45pm

it’s a bit unusual for private v6 to work but not public. can you share the full output of ip -6 route show?

mvodopija · May 6, 2026, 1:01pm

Good catch — and apologies, my “no v6 default route” framing in the OP was wrong. Our runtime image ( Microsoft Artifact Registry ) doesn’t ship iproute2, so the ip -6 route show default in my repro script silently exec-failed and the || echo none fallback fired. With iproute2 installed temporarily on the same machine just now:

$ ip -6 route show
2a02:6ea0:c7c0::e00a:a666:0/127 dev eth0 proto kernel metric 256 pref medium
fdaa:39:f767:a7b:78b:6c09:97a6:0/112 dev eth0 proto kernel metric 256 pref medium
fe80::/64 dev eth0 proto kernel metric 256 pref medium
default via 2a02:6ea0:c7c0::e00a:a666:0 dev eth0 proto static metric 1024 pref medium

$ ip -6 addr show eth0
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1420 state UP
    inet6 2a02:6ea0:c7c0::e00a:a666:1/127 scope global nodad
    inet6 fdaa:39:f767:a7b:78b:6c09:97a6:2/112 scope global nodad
    inet6 fe80::dcad:8fff:fe77:9e08/64 scope link

$ ip -6 -s neigh show
2a02:6ea0:c7c0::e00a:a666:0 dev eth0 lladdr 62:a6:1f:cf:74:dd router REACHABLE

So: public /127 + 6PN /112 + link-local, default route via the /127 peer, gateway is REACHABLE at L2. Yet TCP6 anywhere past the gateway hangs:

$ wget -4 https://www.google.com        → IPv4_OK
$ wget -6 https://www.google.com        → timeout
$ wget -6 https://www.cloudflare.com    → timeout

$ traceroute6 -m 8 -w 2 2001:4860:482b:7700::
traceroute to 2001:4860:482b:7700::, 8 hops max, 80 byte packets
 1  unn-fra.cdn77.com (2a02:6ea0:c7c0::e00a:a666:0)  4.7 ms  4.7 ms  4.7 ms
 2  * * *
 3  * * *
 ... (all * through hop 8)

So the failure isn’t routing on our side — the kernel sends the packet, it reaches the upstream router, and then it’s a black hole. v4 from the same machine to the same destinations is fine, and 6PN intra-org IPv6 (fdaa:…) is also fine — both clearly take a different path. The hop-1 hostname being unn-fra.cdn77.com makes me wonder if the issue is upstream of the Fly host on the cdn77 side rather than per-machine. Both machines in the app show identical behaviour and neither a fly machine restart nor a rolling restart changed anything.

Happy to run mtr -6 over a few minutes or anything else from this end if it’d help narrow it down.

PeterCxy · May 6, 2026, 1:58pm

Does it look better for you now on the machine mentioned above? This is related to a temporary issue with one of our providers and is not going to happen on any new machines created. If you have other machines experiencing the issue, recreating them should also help.

mvodopija · May 6, 2026, 2:09pm

Yes, looks better now, IPv6 connectivity restored. We have workaround in place if it happens again.