Fly Load Balancer Seems to Drop All Packets After First in TCP Relay?

I’d think so? Just to test if this is a configuration issue would you be willing to spin up a small vm on digital ocean, linode, aws, etc… or another cloud provider and see if you’re able to relay data successfully? Or I’ve got an test instance already up and running I’d be happy to set you up an account if you’ve got a public ssh-key? I’ve been able to pull large files through the relay successfully outside of fly which would be weird if the addresses were really backwards.

Yeah, I’ve got a bunch of telemetry hosts I can commandeer for this; I’ll give it a shot tonight or tomorrow (sorry, spent the day driving my daughter to U of I).

Absolutely no worries! I’ve mostly been working on this project during some free time which often falls on the weekend but I totally wouldn’t expect any of your all to work on the weekends!

Hey @thomas have you had any chance over the past week to take a look at running the relay on an external cloud virtual machine?

Hey @thomas and @kurt, would you all possibly be able to take a look at this sometime over the next couple of weeks? Here are the exact configurations I’m using to host the same app on Fly and on an EC2 instance on AWS for testing. Through the EC2 instance running the exact same docker container I’m able to transfer large files from the backend no problem.

On Fly however, the largest transfer able to get through is right around the 1020 bytes range before the UDP packets are mysteriously blocked and the remainder of the transfer never makes it to the app on Fly.

Any help on this issue would be hugely appreciated so that I can finally get started on Fly.

Here is my fly.toml file.

# fly.toml file generated for hyprspace-testing on 2021-05-08T22:45:01-07:00

app = "hyprspace-testing"

kill_signal = "SIGINT"
kill_timeout = 5

[build]
  image = "ghcr.io/hyprspace/relay:latest"

[[services]]
  # Make this port whichever you're running netcat/web server on.
  internal_port = 8080
  protocol = "tcp"

  [[services.ports]]
    port = "80"

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20


[[services]]
  internal_port = 5000
  protocol = "udp"

  [[services.ports]]
    port = "5000"

[env]
  RELAY_PRIVATEKEY = "6PCzxBu2aJpvPFrgPajHAu/irUi5A/5d/FgFnuJGtn8="
  RELAY_PUBLICKEY = "qFcN6hR6uzNrK6ifxaarQrQWdAeuuWXFgjYP5uN7Yzw="
  RELAY_CLIENT_KEYS = "bxD1y69wn+cv1OJlIK88JliMo06rxA0f9N266Sr1cTk="
  RELAY_PORT = "5000"
  RELAY_IPSUBNET = "10.0.0.1/24"
  RELAY_CLIENT_IPS = "10.0.0.11/32"
  # Again whichever port your running the local service (netcat, webserver, etc..) on.
  RELAY_CLIENT_PORTS = "8080"

And here is my docker-compose.yml on the EC2 instance.

version: '3'
services:
  portal:
    image: ghcr.io/hyprspace/relay
    restart: always
    environment:
      - RELAY_PRIVATEKEY=6PCzxBu2aJpvPFrgPajHAu/irUi5A/5d/FgFnuJGtn8=
      - RELAY_PUBLICKEY=qFcN6hR6uzNrK6ifxaarQrQWdAeuuWXFgjYP5uN7Yzw=
      - RELAY_PORT=5000
      - RELAY_IPSUBNET=10.0.0.1/24
      - RELAY_CLIENT_KEYS=bxD1y69wn+cv1OJlIK88JliMo06rxA0f9N266Sr1cTk=
      - RELAY_CLIENT_PORTS=8080
      - RELAY_CLIENT_IPS=10.0.0.11/32
    # Insert Ports to Relay Plus Server Port.
    ports:
      - 5000:5000/udp
      - 80:8080

And here is my client Wireguard config running on the backend which is the same for both environments just switching out the relevant IPv4 address for the EC2 instance or the fly app.

[Interface]
Address = 10.0.0.11
PrivateKey = eA2fH3u5YYkGGNvbY9CKvyaSGs8xHxFgHHaxxXyt+lg=

[Peer]
PublicKey = qFcN6hR6uzNrK6ifxaarQrQWdAeuuWXFgjYP5uN7Yzw=
Endpoint = [[INSERT-IP-HERE]]:5000
AllowedIPs = 10.0.0.1/32

# This is for if you're behind a NAT and
# want the connection to be kept alive.
PersistentKeepalive = 25

Let me know if there’s anything else I can do to help troubleshoot this network bug!

Best,
Alec

P.S. all of these private/public keys are just for testing so I’m not worried about posting them publicly.

I’m diving back into this today. Sorry. I should have been more communicative. You can hit us up here again, or mail us privately, if you’re looking for more status; don’t ever worry about being a bother.

So, here’s the new annoying problem I’m running into: this configuration seems to work fine? (Thanks for providing it).

Here’s what I did:

  1. I copied your fly.toml over mine and fixed the app name to point to my app.

  2. I flyctl deploy'd.

  3. I booted up an EC2 instance with WireGuard on it, copied your exact configuration up (keys and all) as relay.conf, and did wg-quick up ./relay.conf. It handshook immediately.

  4. I kicked off socat TCP4-LISTEN:8080,fork STDOUT | tee stdout.log to pick up incoming connections.

  5. I made some quick test connections to confirm that my fly-app’s port 80 hit 8080/tcp on the EC2 instance over WireGuard. Yup.

  6. I curl --data-binary "@./cracklib-small" http://my-fly-app, which sends a 500k file across the relay; I confirm the whole thing is read on the other side.

I’m still poking around; maybe this problem only happens on large outbound transfers from the EC2 side of the relay.

Nope, that’s not it; I can python -m SimpleHTTPServer 8080 in /usr/share and slurp arbitrary multi-meg files down; it’s not fast but it doesn’t stall weirdly either.

For what it’s worth, we can boot something up on Vultr or something and share access to a server if that’ll help us reproduce and diagnose this together. I’m game for whatever!

This is great that you got the app running and thank you for coming back to take a look at it! I really appreciate the time you’ve put into helping me with this.

Testing this out quickly if I run the Wireguard client end on an AWS instance I’m also not getting any stalling! However if I run a Wireguard Client + HTTP server on a local virtual machine or my laptop I’m still experiencing the stalling. Which could be something with my firewall, except I’ve tested the app on an Intel NUC connected to my ISP connection as well as my laptop running over my cell connection with the same result. Do you all do any filtering of Consumer ISPs versus a cloud provider like AWS?

Your local machine is behind nat? We don’t do any traffic filtering, but we did have to put keepalive settings in Wireguard configs to work around random hangs.

Huh, I am behind a NAT but I’ve also added a 25s keepalive to the client’s Wireguard configuration. Without you all doing some sort of IP filtering that would lead me to think that it must be some local network issue, except that running the Wireguard server within a Docker container on the AWS instance and connecting it locally to my laptop works like a charm for relaying data.

I know this is kinda a pain, but could either of you attempt connecting to a relay hosted on fly from a local machine rather than an AWS instance with Wireguard and see if you experience the dropped connection?

I can use relays somewhere other than AWS (like Vultr or something), but locally is pretty painful right now, since I’ve already got a bunch of WireGuard connections going on my local dev machine, with overlapping addressing and stuff. I’ll poke around and see if I can come up with a way.

Do you have thoughts on what could be breaking this on home networks and not over AWS? Your VM running on Fly doesn’t do anything to tell the difference.

Hey all!

Apologies for taking so long to get back to this thread! I just rewrote the relay on top of TCP and the new Hyprspace Protocol and it’s working flawlessly on Fly! So I guess the issue is related to using UDP? Anyway once again thank you for all your help and taking the time to really work with me on this issue. I’m really looking forward to building more on top of Fly and possibly getting to know you all a little more along the way.

Best,
Alec