Suprisingly high outbound bandwidth charges since 8 Jan

Since 8th of Jan, one of my apps has had a shockingly high data egress. It’s a pretty simple websocket server for our dev environment, but for some reason had 210GB of data egress last month, and the Fly bill preview is estimating 860GB of usage this month.

This is odd as the last deployment we ran was on Dec 15, there haven’t been any other changes made to cause this to occur.

Could this usage please be reviewed by someone from the Fly team? It doesn’t seem likely that my app was actually sending this much data, nothing out of the ordinary in the logs.

Screenshot from 2024-01-20 13-00-25

After rebuilding & deploying the app, it’s back to the expected almost zero B/s when unused. That means that it definitely wasn’t because of external causes, or the issue would be continuing right now

Hi, I’ve taken a look at your app’s usage and there are two things of note:

  • The high data out is specific to one machine. I can see another machine was created on Jan 20th, but it has nowhere near as much data-out recorded.
  • The data out drops down to a much lower number each time the machine is restarted (or stopped and started), and then ramps back up to a huge number. It looks like it has done this multiple times, starting around Nov 13th (roughly midday).

Does any of this help indicate what might be the cause? It’s tough to reason about without knowing more about what the app is doing.

edit: Looking at your app’s metrics in Grafana, specifically around Jan 7th 21:00, which is one point at which I can see a huge spike in data-out, there were a ton of egress connects which line up (in terms of time) with the data-out spike.

1 Like

Thanks @jfent for the help, I’ve upgraded our Fly.io plan to include support

I may have been wrong about the metrics being incorrect, as Nov 13th and Jan 7th are significant dates, as they are when we deployed an update to this app and another related app.

However, it’s still very difficult to see why our data egress is climbing incrementally over time, and how it managed to spike to almost 1MB/s (and remained at that level).

Would Fly.io possibly have more detailed info as to the egress destination that you could email to me? Alternatively, we deploy our apps with a Dockerfile, so if you have any additional any scripts/config that we could include to log external connections, that would be appreciated :heart:

Figured it out with tcpdump and Wireshark… full write-up soon.

TL;DR: High outbound bandwidth was due to enlarged metrics response.

The metrics resonse grew in size due to bot scraping.

Troubleshooting steps

  1. SSH in, extract a packet dump
apt-get update && apt-get install -y tcpdump

tcpdump -i any -w capture.pcap

# wait ~15 mins

exit
  1. Download packet dump to local machine
flyctl ssh sftp shell -a rust-socket-dev-syd
get /app/capture.pcap
  1. Open capture.pcap in Wireshark, notice traffic is betweeen internal IPs (and port 8080 is also internal on my app, not exposed to public network)

This port is only used for health checks and Prometheus metrics.

  1. curl /metrics and see why response is so large

Raised an issue with full explanation here, but whenever we recieved a request with an unknown URL, the metrics response body was extended until next time app was restarted.

2 Likes

Nice write up! Happy to hear you were able to figure out the cause.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.