How to debug why Squid Proxy delivers 500 error codes intermittently?

Tbh im not really sure how to ask this

i have 1 nodes using squid (that have static egress ip)
and my other 4 nodes will connect via proxy to that node to call third party app

this is related to Egress IP sometimes not working 100%?
now suddenly after 3 AM Jakarta time the proxy is not working , i try to change the squid config but still no result

atp im not really sure why and how because at that time im not changing anything and suddenly receive alert error that the api of third party cannot be called in my email

so after i try create simpe code that chek my endpoint 1000 times wether it fails or not

Summary:
200: 815
403: 0
500: 65
other:0
error:120

500 = My Proxy Fails

atp im not sure the error is because the squid that cant handle the connection or something

I doubt there is enough information here, or enough clarity, for readers to help you.

Consider adding a diagram of your apps/machines, so readers can see where traffic is coming in. In general if your Squid Proxy is giving 500 errors intermittently, it is possible that your app is failing, and thus Squid is fine. What HTTP status code is “error”?

Can you run your 1000-check script inside your app, using flyctl ssh console? This will skip the proxy, and you can then see if your app is to blame.

Also check your app logs and your proxy logs.

I think you were right im sorry for the lack of information

for now i will try to make produceable scenario but the hard thing is this issue only occur once a month maybe for whole day suddenly the squid proxy is act weird even tho i already restart

like right now everything works fine again even tho i done nothing besides eating my ramen :smiling_face_with_three_hearts:

i create mini project https://reproduce-proxy.fly.dev/test-proxy?num_requests=250
to check proxy failing or not maybe the next time i see the error i can give more information

No worries. You should find that, even if the problem can no longer be reproduced, you can look at your app or proxy logs for the time the error happened, and the relevant logs will still be there. Is it worth having a look now?

The only logs i see is from my python django app that requests to payment gateway service got rejected (timeout) because cannot connect to proxy

except requests.exceptions.ConnectTimeout as e: raise HTTPException(status_code=502, detail=f"Proxy connection timeout: {str(e)}")

the thing is only happen very very randomly sometimes once a month but sometimes full month is okay
i create custom grafana monitoring in may because i want to monitor this issue

what usually happen is

I woke up
check my message 50+users complain in our whatsapp group that they cannot withdraw their money from our site
check the logs
says proxy timeout
try to restart the squid server
no luck
1 hour later suddenly working again

this already happen like 5-8 time already since last year we migrate to fly

the issue was today the withdraw service down for like 8 hour 3-AM Jakarta - 12 AM Jakarta
and then recover again but with low error rate (some user still got the error but most are fine)
but during 15:00 / 3PM Jakarta time the error occur again
an today 6 PM jakarta time while i build the reproduce. everything is fine again

currently my mitigation is i increase the proxy memory which currently only had 512mb of ram because i saw some spike on the memory which im not sure why

this is my squid config

`# Minimal permissive Squid forward proxy

WARNING: This is an open proxy configuration. Restrict access before exposing publicly.

Listen on port 8080

http_port 3128

Standard ACLs for allowed ports

acl SSL_ports port 443
acl Safe_ports port 80 # http
acl Safe_ports port 21 # ftp
acl Safe_ports port 443 # https
acl Safe_ports port 70 # gopher
acl Safe_ports port 210 # wais
acl Safe_ports port 1025-65535 # unregistered ports
acl Safe_ports port 280 # http-mgmt
acl Safe_ports port 488 # gss-http
acl Safe_ports port 591 # filemaker
acl Safe_ports port 777 # multiling http
acl CONNECT method CONNECT

Allow everything (OPEN PROXY). Harden before production use.

http_access allow all

Optional: Disable on-disk caching

cache deny all
cache_mem 0 MB
maximum_object_size 0 KB
cache_dir null /tmp

DNS and forwarding behavior

dns_v4_first on
forwarded_for on
via off

Keep Squid quiet-ish

shutdown_lifetime 1 seconds`

and the architech is look like this
the reason i need to use proxy is because the payment gateway only allow me to register limited number of ips (4 max i think)

since i know most of our users (just small apps with thousand of users)

i just tell them okay wait lemme fix this (doing absolutely nothing other than restart and pray) and ask them to use the feature again after couple of hour

my guess is the ram but spending another 10$/mo for ram on proxy server little bit ….

Ah, could you be affected by the current network outage? You mention users in India.

We are continuing to work with our external connectivity providers to address the service degradation caused by multiple submarine cable outages. Traffic between India and Europe/US East is impacted. Since this is an external incident, some degradation may still occur. We’ll provide another update as soon as we have more information.

Did you have this problem before 6th September?

(The diagram is very useful. Consider adding that sort of thing in your first post, where you have a question of this type.)

I would put a direct uptime monitor on the payment gateway, and also I would maybe add a mode in one of the Django apps to contact the payment gateway directly, to see if the success rate is better in that one app.

In relation to logs, I would be certain that if you are getting proxy timeouts, there will be logs at the proxy. Keep looking for relevant logs: they do exist, and you’re just not finding them.

My users is from Indonesia and we deploy it on region SIN (Singapore)

you were probably right but right now i think my squid log is not enabled

tbh this is good idea, ill try that but since it was IP whitelisted only prolly ill just get 403 access but i still able to compare it , thanks

OK, that’s the first thing you can fix then, so you can understand the problem next time it happens. :trophy:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.