Hello everyone,
I’m hoping to get some insights from the community or the Fly.io team on a very stubborn connection timeout issue. I have a Python application using Tortoise-ORM and asyncpg
to connect to an external Neon database. The connection consistently times out, but only when running on Fly.io.
I’ve spent a lot of time debugging this and have gathered a lot of evidence, but I’m now officially stuck.
TL;DR - The Core Mystery
- My
release_command
, which runs a script usingTortoise.init()
andTortoise.generate_schemas()
, always fails with aTimeoutError
. - A standalone Python script using
asyncpg
directly succeeds from inside the same Fly.io VM, but only after we added Neon’s special&options=endpoint%3D...
parameter to the database URL. - However, even when we take those exact, known-working connection parameters (host, port, user, pass, ssl, server_settings) and pass them explicitly to
Tortoise.init()
, the application still times out.
This suggests the issue is not the network itself, but a strange interaction between Tortoise-ORM and the Fly.io environment.
The Full Debugging Journey
Here is a log of everything we’ve tested to isolate the problem:
- It’s Environment-Specific: The app connects instantly from my local machine. The timeout only occurs on Fly.io.
- It’s Not a Firewall: A basic
nc -vz <neon-host> 5432
test from inside the Fly VM succeeds. The port is open and reachable. - It’s Not a CA Certificate Issue: We ensured
ca-certificates
was installed in thepython:3.12-slim
image. This had no effect. - It’s an SNI (Server Name Indication) Issue: The breakthrough came when we realized this was related to how Neon routes connections. A direct IP connection failed with an
Endpoint ID is not specified
error, which Neon’s documentation says is due to a lack of SNI. This suggested the original timeout was a symptom of a failed SNI handshake. - A Standalone Script Works! Based on the SNI theory, we created a minimal test script using
asyncpg
directly. When we added the&options=endpoint%3D...
parameter to the URL, this script worked perfectly from inside the Fly VM.
This is the script that SUCCEEDS:
# test_db.py
import asyncio
import asyncpg
import os
async def main():
# Using the full DATABASE_URL with the 'options' param set as a secret
db_url = os.getenv("DATABASE_URL")
print("Attempting to connect with raw asyncpg...")
try:
conn = await asyncpg.connect(dsn=db_url, timeout=15)
result = await conn.fetchval('SELECT 1;')
print(f"✅ SUCCESS: Raw asyncpg connection works. Result: {result}")
await conn.close()
except Exception as e:
print(f"❌ FAILED: Raw asyncpg connection failed.")
print(f" Error: {e}")
if __name__ == "__main__":
asyncio.run(main())
The New Mystery: Why Does Tortoise-ORM Still Fail?
We thought we had the solution. The next logical step was to ensure Tortoise-ORM used these exact same parameters. To rule out any URL parsing bugs in Tortoise, we modified our release script to manually parse the URL and pass the parameters explicitly to Tortoise.init()
.
This is the release_command
script that STILL FAILS with a TimeoutError
:
# postgres_setup.py (simplified)
import os
from urllib.parse import urlparse, parse_qs
from tortoise import Tortoise
async def setup_db():
full_db_url = os.getenv("DATABASE_URL") # This is the known-good URL
parsed_url = urlparse(full_db_url)
query_params = parse_qs(parsed_url.query)
options_str = query_params.get('options', [None])[0]
server_settings = {}
if options_str:
key, value = options_str.split('=', 1)
server_settings[key] = value
print("Attempting to connect with Tortoise.init()...")
try:
await Tortoise.init(
credentials={
"host": parsed_url.hostname,
"port": parsed_url.port,
"user": parsed_url.username,
"password": parsed_url.password,
"database": parsed_url.path.lstrip('/'),
},
server_settings=server_settings,
ssl=query_params.get('sslmode', [None])[0],
modules={"models": ["app.models"]},
db_type="postgres"
)
await Tortoise.generate_schemas()
print("✅ SUCCESS: Tortoise connection works.")
await Tortoise.close_connections()
except Exception as e:
print(f"❌ FAILED: Tortoise connection failed.")
print(f" Error: {e}")
# ... run_async(setup_db()) ...
My Question for the Community
- Has anyone else experienced an issue where a high-level library like Tortoise-ORM fails to connect, but the underlying driver (
asyncpg
) works fine in a standalone script within the same environment? - For the Fly.io team: This is where I’m truly stuck. A connection is clearly possible from the VM. But it seems like when the connection is initiated through the Tortoise library, it fails. Could there be an environmental factor specific to the
release_command
machine (resource limits, process initialization, network policies) that could interfere with Tortoise but not a simple script?
I would be grateful for any ideas or insights the community or staff might have. Thank you!