In-Depth Debugging: Tortoise-ORM Timeout to Neon DB (but raw asyncpg works)

Hello everyone,

I’m hoping to get some insights from the community or the Fly.io team on a very stubborn connection timeout issue. I have a Python application using Tortoise-ORM and asyncpg to connect to an external Neon database. The connection consistently times out, but only when running on Fly.io.

I’ve spent a lot of time debugging this and have gathered a lot of evidence, but I’m now officially stuck.

TL;DR - The Core Mystery

  • My release_command, which runs a script using Tortoise.init() and Tortoise.generate_schemas(), always fails with a TimeoutError.
  • A standalone Python script using asyncpg directly succeeds from inside the same Fly.io VM, but only after we added Neon’s special &options=endpoint%3D... parameter to the database URL.
  • However, even when we take those exact, known-working connection parameters (host, port, user, pass, ssl, server_settings) and pass them explicitly to Tortoise.init(), the application still times out.

This suggests the issue is not the network itself, but a strange interaction between Tortoise-ORM and the Fly.io environment.


The Full Debugging Journey

Here is a log of everything we’ve tested to isolate the problem:

  1. It’s Environment-Specific: The app connects instantly from my local machine. The timeout only occurs on Fly.io.
  2. It’s Not a Firewall: A basic nc -vz <neon-host> 5432 test from inside the Fly VM succeeds. The port is open and reachable.
  3. It’s Not a CA Certificate Issue: We ensured ca-certificates was installed in the python:3.12-slim image. This had no effect.
  4. It’s an SNI (Server Name Indication) Issue: The breakthrough came when we realized this was related to how Neon routes connections. A direct IP connection failed with an Endpoint ID is not specified error, which Neon’s documentation says is due to a lack of SNI. This suggested the original timeout was a symptom of a failed SNI handshake.
  5. A Standalone Script Works! Based on the SNI theory, we created a minimal test script using asyncpg directly. When we added the &options=endpoint%3D... parameter to the URL, this script worked perfectly from inside the Fly VM.

This is the script that SUCCEEDS:

# test_db.py
import asyncio
import asyncpg
import os

async def main():
    # Using the full DATABASE_URL with the 'options' param set as a secret
    db_url = os.getenv("DATABASE_URL")
    print("Attempting to connect with raw asyncpg...")
    try:
        conn = await asyncpg.connect(dsn=db_url, timeout=15)
        result = await conn.fetchval('SELECT 1;')
        print(f"✅ SUCCESS: Raw asyncpg connection works. Result: {result}")
        await conn.close()
    except Exception as e:
        print(f"❌ FAILED: Raw asyncpg connection failed.")
        print(f"   Error: {e}")

if __name__ == "__main__":
    asyncio.run(main())

The New Mystery: Why Does Tortoise-ORM Still Fail?

We thought we had the solution. The next logical step was to ensure Tortoise-ORM used these exact same parameters. To rule out any URL parsing bugs in Tortoise, we modified our release script to manually parse the URL and pass the parameters explicitly to Tortoise.init().

This is the release_command script that STILL FAILS with a TimeoutError:

# postgres_setup.py (simplified)
import os
from urllib.parse import urlparse, parse_qs
from tortoise import Tortoise

async def setup_db():
    full_db_url = os.getenv("DATABASE_URL") # This is the known-good URL
    parsed_url = urlparse(full_db_url)
    query_params = parse_qs(parsed_url.query)
    options_str = query_params.get('options', [None])[0]
    
    server_settings = {}
    if options_str:
        key, value = options_str.split('=', 1)
        server_settings[key] = value

    print("Attempting to connect with Tortoise.init()...")
    try:
        await Tortoise.init(
            credentials={
                "host": parsed_url.hostname,
                "port": parsed_url.port,
                "user": parsed_url.username,
                "password": parsed_url.password,
                "database": parsed_url.path.lstrip('/'),
            },
            server_settings=server_settings,
            ssl=query_params.get('sslmode', [None])[0],
            modules={"models": ["app.models"]},
            db_type="postgres"
        )
        await Tortoise.generate_schemas()
        print("✅ SUCCESS: Tortoise connection works.")
        await Tortoise.close_connections()
    except Exception as e:
        print(f"❌ FAILED: Tortoise connection failed.")
        print(f"   Error: {e}")

# ... run_async(setup_db()) ...

My Question for the Community

  1. Has anyone else experienced an issue where a high-level library like Tortoise-ORM fails to connect, but the underlying driver (asyncpg) works fine in a standalone script within the same environment?
  2. For the Fly.io team: This is where I’m truly stuck. A connection is clearly possible from the VM. But it seems like when the connection is initiated through the Tortoise library, it fails. Could there be an environmental factor specific to the release_command machine (resource limits, process initialization, network policies) that could interfere with Tortoise but not a simple script?

I would be grateful for any ideas or insights the community or staff might have. Thank you!

Hi… There were several other reports of sporadic connectivity glitches to neon.tech recently, so I wonder whether it’s the particular Machines rather than the scripts themselves that made the difference.

(The release_command runs on a new, ephemeral Machine each time.)

I think its related to the iad region, I deployed to another region and it worked!

1 Like