Suprisingly high outbound bandwidth charges since 8 Jan

oscarhermoso · January 20, 2024, 5:24am

Since 8th of Jan, one of my apps has had a shockingly high data egress. It’s a pretty simple websocket server for our dev environment, but for some reason had 210GB of data egress last month, and the Fly bill preview is estimating 860GB of usage this month.

This is odd as the last deployment we ran was on Dec 15, there haven’t been any other changes made to cause this to occur.

Could this usage please be reviewed by someone from the Fly team? It doesn’t seem likely that my app was actually sending this much data, nothing out of the ordinary in the logs.

oscarhermoso · January 20, 2024, 7:01am

After rebuilding & deploying the app, it’s back to the expected almost zero B/s when unused. That means that it definitely wasn’t because of external causes, or the issue would be continuing right now

jfent · January 23, 2024, 11:13am

Hi, I’ve taken a look at your app’s usage and there are two things of note:

The high data out is specific to one machine. I can see another machine was created on Jan 20th, but it has nowhere near as much data-out recorded.
The data out drops down to a much lower number each time the machine is restarted (or stopped and started), and then ramps back up to a huge number. It looks like it has done this multiple times, starting around Nov 13th (roughly midday).

Does any of this help indicate what might be the cause? It’s tough to reason about without knowing more about what the app is doing.

edit: Looking at your app’s metrics in Grafana, specifically around Jan 7th 21:00, which is one point at which I can see a huge spike in data-out, there were a ton of egress connects which line up (in terms of time) with the data-out spike.

oscarhermoso · January 26, 2024, 4:19am

Thanks @jfent for the help, I’ve upgraded our Fly.io plan to include support

I may have been wrong about the metrics being incorrect, as Nov 13th and Jan 7th are significant dates, as they are when we deployed an update to this app and another related app.

However, it’s still very difficult to see why our data egress is climbing incrementally over time, and how it managed to spike to almost 1MB/s (and remained at that level).

Would Fly.io possibly have more detailed info as to the egress destination that you could email to me? Alternatively, we deploy our apps with a Dockerfile, so if you have any additional any scripts/config that we could include to log external connections, that would be appreciated

oscarhermoso · January 26, 2024, 6:55am

Figured it out with tcpdump and Wireshark… full write-up soon.

oscarhermoso · January 26, 2024, 7:24am

TL;DR: High outbound bandwidth was due to enlarged metrics response.

The metrics resonse grew in size due to bot scraping.

Troubleshooting steps

SSH in, extract a packet dump

apt-get update && apt-get install -y tcpdump

tcpdump -i any -w capture.pcap

# wait ~15 mins

exit

Download packet dump to local machine

flyctl ssh sftp shell -a rust-socket-dev-syd
get /app/capture.pcap

Open capture.pcap in Wireshark, notice traffic is betweeen internal IPs (and port 8080 is also internal on my app, not exposed to public network)

This port is only used for health checks and Prometheus metrics.

curl /metrics and see why response is so large

Raised an issue with full explanation here, but whenever we recieved a request with an unknown URL, the metrics response body was extended until next time app was restarted.

github.com/nlopes/actix-web-prom

Should not extend metrics response on 404 (Uncontrolled Resource Consumption Vulnerability)

opened 07:17AM - 26 Jan 24 UTC

oscarhermoso

Currently, when an endpoint receives a 404, it extends the Prometheus metrics re…sponse by approx ~15 new lines / ~1.5KB. For example, my server does not include a `/time.php` endpont, so when I was bot scraped by a bot, of these lines were appended. ```sh # ... public_api_http_requests_duration_seconds_bucket{endpoint="/time.php",method="GET",status="404",le="0.005"} 1 public_api_http_requests_duration_seconds_bucket{endpoint="/time.php",method="GET",status="404",le="0.01"} 1 public_api_http_requests_duration_seconds_bucket{endpoint="/time.php",method="GET",status="404",le="0.025"} 1 public_api_http_requests_duration_seconds_bucket{endpoint="/time.php",method="GET",status="404",le="0.05"} 1 public_api_http_requests_duration_seconds_bucket{endpoint="/time.php",method="GET",status="404",le="0.1"} 1 public_api_http_requests_duration_seconds_bucket{endpoint="/time.php",method="GET",status="404",le="0.25"} 1 public_api_http_requests_duration_seconds_bucket{endpoint="/time.php",method="GET",status="404",le="0.5"} 1 public_api_http_requests_duration_seconds_bucket{endpoint="/time.php",method="GET",status="404",le="1"} 1 public_api_http_requests_duration_seconds_bucket{endpoint="/time.php",method="GET",status="404",le="2.5"} 1 public_api_http_requests_duration_seconds_bucket{endpoint="/time.php",method="GET",status="404",le="5"} 1 public_api_http_requests_duration_seconds_bucket{endpoint="/time.php",method="GET",status="404",le="10"} 1 public_api_http_requests_duration_seconds_bucket{endpoint="/time.php",method="GET",status="404",le="+Inf"} 1 public_api_http_requests_duration_seconds_sum{endpoint="/time.php",method="GET",status="404"} 0.000011828 public_api_http_requests_duration_seconds_count{endpoint="/time.php",method="GET",status="404"} 1 # ... public_api_http_requests_total{endpoint="/time.php",method="GET",status="404"} 1 ``` Then, repeat for 100+ other 404'ing endpoints, and you end up with a really large metrics response, [resulting in hundreds of GB of egress on your `/metrics` endpoint](https://community.fly.io/t/suprisingly-high-outbound-bandwidth-charges-since-8-jan/17799) (which has been kindly refunded by the Fly.io billing team). ![Screenshot from 2024-01-26 15-03-36](https://github.com/nlopes/actix-web-prom/assets/23239955/63116c56-3d3f-42bf-8899-a651ddc7f2b9) Furthermore, a dedicated attacker could certainly exhaust significantly more resources simply by 404ing on more URLs To resolve this issue, 404 responses from endpoints that fail to match should not be included in metrics. I believe [this PR](https://github.com/nlopes/actix-web-prom/pull/79) should resolve the issue? Also recommend submitting an advisory to RustSec for this version of the package.

jfent · January 26, 2024, 10:24am

Nice write up! Happy to hear you were able to figure out the cause.

system · February 2, 2024, 10:25am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unexpected usage of outbound bandwidth Questions / Help rails , billing	6	451	January 10, 2024
High internal bandwidth usage Questions / Help	9	874	June 1, 2022
Bandwidth usage monitoring Questions / Help	8	862	October 29, 2022
340 MB of outbound bandwidth on unused postgres instance	3	526	January 31, 2022
Sudden decrease in throughput, no recent changes Questions / Help elixir	13	538	October 21, 2022

Suprisingly high outbound bandwidth charges since 8 Jan

Troubleshooting steps

Related topics