Idle Redis connections eventually begin to hang, high Redis latency

I’ve been playing around with Fly.io and absolutely in love already, but spotting a few rough edges along the way of course :slight_smile:

It seems when connecting to Redis (from Python aioredis), an idle connection will eventually begin to hang, seemingly after a delay of at least 10-15 mins. Is there something like a connection/state timeout implemented perhaps somewhere in the networking layer that would eventually just start dropping packets rather than cause RSTs to be returned?

The trouble with the hangs is that aioredis in its current state doesn’t cope with them at all. From quick glance, seems neither the Connection nor Pool classes have an awareness of this possibility

In the process of fixing that, I changed my code to reconnect to Redis on each request, and dropped some timing print()s around the connection attempt. The results are both consistent and surprising (for LHR):

2021-03-21T13:28:09.509Z e8beb63b lhr [info] set sock opt
2021-03-21T13:28:09.511Z e8beb63b lhr [info] prepare req
2021-03-21T13:28:09.511Z e8beb63b lhr [info] make strm resp
2021-03-21T13:28:09.572Z e8beb63b lhr [info] redis conn took 0.05867815017700195
2021-03-21T13:28:09.573Z e8beb63b lhr [info] lookup in redis
2021-03-21T13:28:09.630Z e8beb63b lhr [info] lookup in redis took 0.05709409713745117
2021-03-21T13:28:09.642Z e8beb63b lhr [info] redis close took 8.511543273925781e-05

Edit: also similar latency in CDG:

2021-03-21T13:45:12.872Z da478d62 cdg [info] redis conn took 0.03713679313659668
2021-03-21T13:45:12.873Z da478d62 cdg [info] lookup in redis
2021-03-21T13:45:12.925Z da478d62 cdg [info] lookup in redis took 0.05064725875854492
2021-03-21T13:45:12.938Z da478d62 cdg [info] redis close took 6.628036499023438e-05

I see this problem isn’t new:

But in that case, it was for another colo. Is there perhaps some tweak that can be applied to LHR that was previously used to fix the issue at the other colo? :slight_smile:

Really cannot understate how much I love this service model, and dearly hoping your bandwidth pricing survives the test of time (including any eventual influx of content-heavy customers like video). It could easily make some businesses possible while bankrupting others!

Thanks

When this happens, it’s almost always a connection timing out because its idle. Drivers have various ways of dealing with this, they basically need to PING Redis occasionally and recreate the connection if it idles out.

The aioredis docs are vague, but it seems like using a connection pool might make reconnects work:

redis = await aioredis.create_redis_pool('redis://localhost')

You are going to be better off running dedicated Redis instances on Fly than using the shared redis. Since it connects to multitenant instances through our load balancer, drivers have to be especially good to handle “normal” connection disruptions. Dedicated Redis instances let you connect directly over the private network and keep things simpler.