Fly Load Balancer Seems to Drop All Packets After First in TCP Relay?

Hi All,

I’m attempting to test a permission-less Wireguard Relay I wrote but it looks like the Fly load balancer is blocking all subsequent TCP packets after the first is sent back to the requester. I’ve tested this same container on a standard VPS and it all seems to be working there so I’m curious if something might be different with the Fly networking stack? For this testing system I’m running the fly app (named hyprspace-testing) as a relay which accepts tcp connections (I’m not using any http handlers) and then forwards them over a Wireguard connection to a backend webserver (http://138.68.6.135:8080/). On the backend I can see that the connections are correctly getting through to the webserver and it’s responding to the relay. I can even see that the relay is accepting the packets through the UDP port but it seems like they’re getting dropped going from the Fly app to the load balancer?

Here is the fly.toml I’m using.

app = "hyprspace-testing"

kill_signal = "SIGINT"
kill_timeout = 5

[build]
  image = "ghcr.io/hyprspace/relay:main"

[[services]]
  internal_port = 8080
  protocol = "tcp"

  [[services.ports]]
    port = "80"

[[services]]
  internal_port = 53
  protocol = "udp"

  [[services.ports]]
    port = "53"

[env]
  RELAY_PRIVATEKEY = "RELAY-PRIVATE-KEY"
  RELAY_PUBLICKEY = "RELAY-PUBLIC-KEY"
  RELAY_CLIENT_KEYS = "CLIENT-PUBLIC-KEY"
  RELAY_PORT = "53"
  RELAY_CLIENT_IPS = "10.0.0.2/32"
  RELAY_CLIENT_PORTS = "8080"

And here’s a screenshot of the Wireguard connection from the perspective of the backend web server.

image

Thank you all for your time!

Best,
Alec

Let’s get this working. I want to make sure I understand all the pieces here:

  1. You’re running your own WireGuard, in your VM? You’re not using our WireGuard peering connections?

  2. If I’m right about (1) (I think I am, you have IPv4 AllowedIPs): are you using kernel or user-mode (wireguard-go) WireGuard?

  3. I’m not super clear from this description where you’re seeing packets getting to. It sounds like we’ve got:

    user <---> relay (on fly) <----> web server (elsewhere)

    with user → relay HTTP, and relay → web server WireGuard. The web server sees HTTP requests from the user. Does the relay see the HTTP responses from the web server?

  4. If I wanted to replicate your setup, I can just clone the repo you linked to, and point it at a web server somewhere I own, and set up WireGuard?

You’re running your own WireGuard, in your VM? You’re not using our WireGuard peering connections?

Haha yes sorry I forgot you all also have a Wireguard implementation within the flyctl app! Yes I’m running my own Wireguard server inside of the container.

If I’m right about (1) (I think I am, you have IPv4 AllowedIPs): are you using kernel or user-mode (wireguard-go) WireGuard?

I’m actually using a fairly similar setup to how flyctl’s implementation of Wireguard seems to work. (Userspace wireguard-go plus virtualizing the tun device through Go.)

Does the relay see the HTTP responses from the web server?

Yep I’m seeing responses get all the way back up to the relay just not out of the fly load balancer.

If I wanted to replicate your setup, I can just clone the repo you linked to, and point it at a web server somewhere I own, and set up WireGuard?

Totally! You can actually test it locally on your laptop as well since the relay container running in fly is the Wireguard “server” the connecting client doesn’t need to have a public IP address.

Interesting! I think this might relate to the concurrency settings for the service ports. I just added and increased the concurrency for each of the ports to 200/250 respectively and that seems to have done it! It’s now loading the whole web page instead of just the first packet or so!

I guess from my understanding I thought that the concurrency settings really just applied to booting up more instances with auto scaling enabled. But is it the case that it actually rate limits the number of connections to a single app container?

Edit: Spelling

Concurrency can have an impact on connections routed through our proxy. But just to be clear: if you’re running WireGuard on your instances, that’s UDP, and UDP bypasses our proxy and our concurrency limits.

Hmm maybe I just got lucky in that once test. Rebuilding the Fly App I’m again experiencing the same issue. I’m running the exact same container (ghcr.io/hyprspace/relay) on both a DigitalOcean Droplet and on Fly. Connecting my laptop to the relay running on DO I’m able to transfer large files and stream data across the relay. However, with the exact same config running on Fly I’m only getting the first couple of bytes back before the request hangs indefinitely.

OK, I’m trying to reproduce this now.

Just a quick note that our native WireGuard support has this as a direct use case; the idea is, you’d generate a WireGuard peer for your (say) EC2 host that’s running a web server (or wherever it is), and speak to it from your exposed front-end service using the private addresses we give WireGuard peers and Fly apps.

Random question: are you running shared-cpu-1x instances? We have seen things get hung up for weird reasons because there’s only a single CPU. You can test this by scaling to dedicated-cpu-2x instances for a few minutes.

OK, I’m trying to reproduce this now.

Thanks so much for your help!

Just a quick note that our native WireGuard support has this as a direct use case; the idea is, you’d generate a WireGuard peer for your (say) EC2 host that’s running a web server (or wherever it is), and speak to it from your exposed front-end service using the private addresses we give WireGuard peers and Fly apps.

Oh awesome! I guess the only problem in my case is that the Relay sort of is my whole front end application. (Although it’d probably work to just stick something like Nginx in a container pointing at the internal Fly Wireguard network.) For now I might stick to using this other container but thanks for the heads up this is great to know!

Random question: are you running shared-cpu-1x instances? We have seen things get hung up for weird reasons because there’s only a single CPU. You can test this by scaling to dedicated-cpu-2x instances for a few minutes.

Yep I was running a single shared-cpu-1x instance. Scaling up to a dedicated-cpu-2x I’m still seeing the same results as before unfortunately. First couple of lines from an HTML page make it through but then everything hangs indefinitely waiting for the rest.

P.S. I think the scaling docs might be out of date under

Viewing The Current VM Size

The command fly scale vm no longer seems to print the current size of the apps vms. Instead it looks like you have to run fly scale show to get that same output.

It looks like I was wrong about all of the Wireguard UDP packets making it back to the relay. Doing some testing this morning it looks like they’re getting blocked going from my laptop back up to the Relay in Fly. I’ve tried a couple of different port numbers for the UDP socket but with all the same results. Typically I’d think this would be something wrong with my local connection or ISP except the same container forwarding data from my laptop is working on other cloud providers.

Does the Fly load balancer do any packet filtering based on IP location? In a quick test changing the source of the packets from my laptop to a Cloud VM hosted on Digital Ocean it looks like they’re all getting through.

So it looks like my laptop can forward packets to Digital Ocean and Digital Ocean can forward packets to Fly, but my Laptop gets cut off after 1-2 packets sending directly to Fly.

@thomas did you ever get a chance to try and reproduce this? I’m still experiencing lost UDP packets on Fly.

I’m having a little trouble getting my head around the deployment model for this. I have it built locally. Can you relate to me the simplest possible configuration — like, if I’m just aiming this at netcat or something — that I could use to deploy it in?

I’m sorry, I misread your previous comment as you having gotten past your UDP forwarding problem. If you’re having trouble with UDP routing on our network, I can get you at least that far.

I’m having a little trouble getting my head around the deployment model for this. I have it built locally. Can you relate to me the simplest possible configuration — like, if I’m just aiming this at netcat or something — that I could use to deploy it in?

Absolutely! Here’s an example deployment toml file below. (If you’re making changes to hyprspace itself in your testing you can change out the image key for the local Dockerfile.)

app = "hyprspace-testing"

kill_signal = "SIGINT"
kill_timeout = 5

[build]
  image = "ghcr.io/hyprspace/relay:latest"

[[services]]
  # Make this port whichever you're running netcat/web server on.
  internal_port = 8080
  protocol = "tcp"

  [[services.ports]]
    port = "80"

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20


[[services]]
  internal_port = 5000
  protocol = "udp"

  [[services.ports]]
    port = "5000"

[env]
  RELAY_PRIVATEKEY = "<<INCLUDE_YOUR_RELAY'S_PRIVATE KEY_HERE>>"
  RELAY_PUBLICKEY = "<<INCLUDE_YOUR_RELAY'S_PUBLIC_KEY_HERE>>"
  RELAY_CLIENT_KEYS = "INCLUDE_YOUR_LAPTOP_SERVER'S_PUBLIC_KEY_HERE>>"
  RELAY_PORT = "5000"
  RELAY_IPSUBNET = "10.0.0.1/24"
  RELAY_CLIENT_IPS = "10.0.0.11/32"
  # Again whichever port your running the local service (netcat, webserver, etc..) on.
  RELAY_CLIENT_PORTS = "8080"
  1. First start by generating a Wireguard public key and private key for the relay and client. With something like this. (Although your might already have a set that you’d like to use.)
#!/bin/bash
SERVER_PRIVATEKEY=$(wg genkey)
SERVER_PUBLICKEY=$(echo "$SERVER_PRIVATEKEY" | wg pubkey)
CLIENT_PRIVATEKEY=$(wg genkey)
CLIENT_PUBLICKEY=$(echo "$CLIENT_PRIVATEKEY" | wg pubkey)

echo "Server Private Key: $SERVER_PRIVATEKEY"
echo " Server Public Key: $SERVER_PUBLICKEY"
echo ""
echo "Client Private Key: $CLIENT_PRIVATEKEY"
echo " Client Public Key: $CLIENT_PUBLICKEY"
  1. Replace the public key and private key values in the toml config and launch to fly.
  2. Copy the ipv4 or ipv6 address from fly ips list into a local wireguard config file. I’ve been using something like this. (/etc/wireguard/wg0.conf)
[Interface]
Address = 10.0.0.11
PrivateKey = LAPTOP_PRIVATE_KEY

[Peer]
PublicKey = RELAYS_PUBLIC_KEY
Endpoint = {RELAY_IP_ADDR}:{RELAY_PORT}
AllowedIPs = 10.0.0.0/24

# This is for if you're behind a NAT and
# want the connection to be kept alive.
PersistentKeepalive = 25

  1. Then after initiating the connection (sudo wg-quick up wg0) any local application (netcat, etc…) running on the port your forwarding through the relay (in this config 8080) should be available through the fly app’s ip address.

Let me know if there’s anything else I can do to help! Thanks so much again for working on this!

OK! I’m trying to get this working now.

A quick note that your keys in the [env] block are what we have flyctl secrets for.

1 Like

So what you’re trying to do here makes sense to me; here’s what I’m seeing, though:

I’ve got an instance of the app deployed (I built my own Docker image from your source code). It deploys to Fly just fine.

I’m able to bring up a WireGuard connection through it, from my Ubuntu NUC on my home network through 5000/udp to my running instance of your relay application.

In a shell on the relay, with tcpdump, I see bidirectional 5000/udp traffic when I connect to the public IP of the application.

On my NUC, on the client side, I can see with wg that WireGuard is seeing bidirectional traffic. If I watch the WireGuard interface with tcpdump, I see incoming connection attempts — from 10.0.0.1 (the serverside?) to 10.0.02 (the clientside?). But I don’t see SA packets from my side; the TCP 3WH isn’t completing, on my NUC.

That suggests to me that the UDP routing is probably working?

Interesting! So just double checking that I understand, you’re getting packets passed down from the relay to your NUC but it doesn’t look like they’re completing the handshake and thus your not getting a response at all through the relay? Or are you also reading packets back up through the Wireguard interface but they’re not making it out of the Relay Docker container? Are you able to get a simple “hello world” message, etc… back from a simple HTTP/TCP server or nothing at all?

So: I expose 80/tcp on our Anycast, routed to 8080/tcp on the Relay Fly app, with nc -l 8080 on my NUC.

The NUC establishes a WG circuit with the Relay Fly app. I can see traffic, bidirectionally, over the WG circuit in tcpdump -vvnX -i relayclient.

I connect to myrelay.fly.dev:80 with telnet; I type stuff. I see traffic on the Relay Fly app in tcpdump -vvnX -i eth0 tcp port 8080 and in tcpdump -vvnX -i eth0 udp port 5000.

I see SYN packets arriving at my NUC. I don’t see SYN+ACK packets returned.

In this configuration (your configuration), I think? my NUC is 10.0.0.2? I can’t ping it from the Fly app.

A challenge with this is that you’re using user-mode WireGuard, so it’s tricky to debug the networking on the Fly app side (I’m willing to keep trying!).

Weirdly: the wg.conf you supplied says my NUC interface is 10.0.0.11, but when I connect to the Anycast address, the relay generates:

    10.0.0.1.16465 > 10.0.0.2.8080: Flags [S], cksum 0xd0e2 (correct), seq 706267490, win 27584, options [mss 1380,nop,nop,TS val 4028544818 ecr 0,nop,wscale 5], length 0

Weird! Running the same config and printing out the destination address of the virtual tcp connection within the relay program (fmt.Println(conn2.RemoteAddr().String())). I’m seeing the correct value of 10.0.0.11:8080. If you run your docker container locally on the NUC as well, are you seeing it pass back SYN+ACK packets as normal? I’m passing large files through a local test without issue.

I made a new branch inside of the relay repository (fly-debugging) with some more debugging information from the internal Wireguard interface. Hopefully this helps some!

It would seem like the “correct” IP is 10.0.0.2, right? tcpdump is the ground truth here.