Edge proxy leaves connection half-open

Hi everyone,

this is an issue for which I spent quite some time debugging, already.

TL;DR: The Fly edge proxy, which terminates the TLS connection and forwards the request to the app using plain HTTP, leaves a client with a PUT request hanging after the app has closed the connection.

The case:

  • there is an HTTP client which uploads data to an app using a simple PUT request
  • the app responds with a streaming response (using chunked transfer encoding)
  • some event happens, which causes the app to terminate the upload prematurely, by closing the connection
  • the edge proxy keeps the client connection open and continues to consume data (probably until the buffer is full)
  • the client hangs and doesn’t terminate

I’ve tested this with different variants, one being that the app reports a TCP level broken pipe error. The effect seems to be the same. Of course, things work all ok locally without an intermediate proxy. They also work ok with a local reverse proxy, which detects the closed server connection and propagates it to the client. However, they don’t work with fly.io’s proxy.

Could there be a problem with how the fly.io edge proxy handles the connection? Did anybody experience similar problems?

Hi @fungs :waving_hand:, could you provide a bit more detail in terms of what the client / server side do? For example,

the app reports a TCP level broken pipe error

Is this one of the errors for the app’s outgoing connections, which in turn results in it closing the incoming connection ,or does this come from the incoming side?

  • the edge proxy keeps the client connection open and continues to consume data (probably until the buffer is full)
  • the client hangs and doesn’t terminate

The proxy doesn’t buffer data (or rather, it does before the connection gets sent to the machine, and only up to a certain small size), so I’m not sure how it would keep consuming data without completely stalling the connection instead.

Also, if the client is connecting using HTTP/2, the underlying TCP connection won’t be closed when an H2 stream is closed. I’m not sure that’s what’s happening here, but if the client expects the TCP connection to be closed and not just an HTTP stream, this could be problematic for HTTP/2.

It could also help if we have a series of timestamps on when these happened, along with the app name / domain.

Hi @PeterCxy, I can provide more details. Both the app and the fly proxy understand HTTP/2, but for better debugging, I have pinned everything to HTTP/1.1 for this issue, and used curl for all client-side request simulation. The effect seems to be the same. The error is fully reproducible.

Clarifications regarding your questions:

  1. General scenario summary: the client (curl) uploads data using the HTTP method PUT while receiving a streaming response at the same time. When the response is terminated by the server/app irregularly, by closing the underlying TCP connection (either simple close, or with a broken pipe), this termination event doesn’t propagate to the client. In contrast, if I replace the edge proxy with something like HAProxy locally, this proxy properly closes the client connection once it receives a TCP-level closing event.
  2. With the broken pipe scenario, the TCP connection from the server/app outgoing connection to the edge proxy is terminated with a broken pipe error. The client (curl) doesn’t receive it and keeps hanging in the upload. I think that some 10 MiB or so are still consumed, before the data upload stalls. I can’t say where the data is going, but I can tell that the server/app has effectively closed the TCP connection before, and therefore isn’t consuming the data.
  3. With the simple TCP close scenario, in which the server/app just closes the TCP connection, the effect is the same as for the broken pipe error.
  4. Concerning HTTP/2, the connection behavior with curl is correct for this protocol version in my tests, but I will verify again once it works for HTTP/1.1. As said, I have pinned everything to HTTP/1.1 at the moment, to be sure this isn’t the issue. The client is not per se expecting the connection to be closed at the TCP level, it’s just one of the things that can happen. In most cases, the server/app will deliver a full response, which propagates well through the edge proxy.

Maybe it’s best to jump into a short debugging session with you? Otherwise, I can create a run, record the exact time stamps, and send them to you via PM along with the other details?

I did some testing of my own, but was unable to reproduce this. I don’t have the actual app at hand so all I did was putting together something like this (as a Python aiohttp handler):

async def http_handler_bad(request):
    await request.content.read(128)
    response = web.StreamResponse(
        status=200,
        reason='OK',
        headers={'Content-Type': 'text/plain'},
    )

    await response.prepare(request)
    await request.content.read(128)

    await asyncio.sleep(5)
    request.transport.close()

When this receives a POST/PUT/.. request with a body, it would send a response header, then block for 5 seconds, and close the underlying transport (connection). It does also read something from the HTTP body as well before closing.

curl-ing this behind fly-proxy with a somewhat large body (a few gigabytes) seem to correctly result in a

< HTTP/1.1 200 OK
< content-type: text/plain
< transfer-encoding: chunked
< date: Wed, 23 Apr 2025 18:34:02 GMT
< server: Fly/e4164f338 (2025-04-23)
< via: 1.1 fly.io, 1.1 fly.io
< fly-request-id: 01JSHYVB91BGJMZXNSKJYM0ESA-yyz
<
* Recv failure: Connection reset by peer
* closing connection #0

Do you think your app does something significantly different from this pattern? (And yes, if you can send over exact time stamps and your app name, it would help a lot)

1 Like

Hey, your application example looks similar to what my application is doing, in theory. What I can already say, at the first glance, is that the client curl error is different, when run locally.

I just re-tested the broken pipe case via fly-proxy, and it seems to work now. For some reason, it didn’t when I was comparing it in my last debugging session. I apologize, let’s concentrate on the other case!

Here is the output, when the server closes the connection, locally. The server application in that case is HAProxy, which should be fairly standard and ubiquitous.

pv < /dev/zero | curl -v --request PUT -T . --no-buffer --http1.1 'http://localhost:9000/path'

< HTTP/1.1 200 OK
< content-type: text/plain
< access-control-allow-origin: *
< transfer-encoding: chunked
< date: Wed, 23 Apr 2025 20:51:04 GMT
< connection: close
< 
* Done waiting for 100-continue
RESPONSE_CONNECTED
RESPONSE_TRANSFERRING
RESPONSE_PREMATURE_ABORT
* we are done reading and this is set to close, stop send
* Closing connection 0

Here is a most recent curl output for a request via fly-proxy, and including the fly-request-id:

pv < /dev/zero | curl -v --request PUT -T . --no-buffer --http1.1 'https://app-name.fly.dev/path

< HTTP/1.1 100 Continue
< HTTP/1.1 200 OK
< content-type: text/plain
< access-control-allow-origin: *
< transfer-encoding: chunked
< date: Wed, 23 Apr 2025 21:19:39 GMT
< connection: close
< server: Fly/e660f5c79 (2025-04-22)
< via: 1.1 fly.io, 1.1 fly.io
< fly-request-id: 01JSJ8AJWGNB2DDWC3613F0XZD-fra
< 
RESPONSE_CONNECTED
RESPONSE_TRANSFERRING
RESPONSE_PREMATURE_ABORT

[--> client kept uploading more than 100 MiB, until I terminated it]

I was wondering where the data was going. I first stopped the VM, which closed the connection and terminated the upload. Secondly, I froze the haproxy process in the VM, which also terminated the upload after 10 MiB buffering. So the data still seems to go to the VM somehow, although HAProxy claims to have terminated the connection and even reports a termination state in its log file. Even if this is a bug in all versions of HAProxy, I don’t quite understand why it works without an itermediate fly-proxy, locally.

I’m happy to equip you with access to my app to create your own test cases, if you like. In that case, please provide me with a private channel to send over the details.

I also think I might be able to build out your example with HAProxy mid next week. Unfortunately, I will be mostly unavailable from tomorrow to Monday.

That sounds good! If you can put the simple Python test case behind HAProxy, that’ll help us rule out whether it’s something about HAProxy. There might still be some issue with our proxy in this case, but it seems like it’s triggered by something specific in your setup.

This will certainly be helpful if possible. Feel free to send an email to peter at fly dot io for information that can’t be shared publicly on the forum :slight_smile:

just a ping to prevent this topic from closing

Hey,

I took your simple app as a template to do more testing. I think I now have a good understanding of where the issue may be.

HTTP response termination in streaming mode

There are two levels of termination, one is the HTTP protocol level, and the other is the TCP connection level. In my previous example, the server sets the response close header, and the client is reacting to that header when it detects that the response is finished at the HTTP protocol level. The client then stops the ongoing upload and terminates correctly. The end of the streaming response is signaled by a zero block in the chunked transfer encoding, and can be seen when starting curl with --raw. This works well locally.

The issue behind the fly.io proxy is, that the final zero chunk does not arrive right after the app has sent it. This makes the client believe that the upload is ongoing. This behavior then triggers handling of the connection termination at the TCP level, which should not be the case.

Interestingly, in this setup, if curl receives the final zero chunk in the streaming response, but doesn’t receive the response close header, it keeps hanging. This may be a curl bug.

TCP level connection termination

TCP level handling is a little more complicated, because there are several connections involved. If the client tries to send more data, and if the TCP connection is closed by the app, the fly.io proxy realizes this and terminates the connection. This only works, if the client sends new data, but if it keeps the connection up without sending new data, the client just hangs.

Previous TCP-level behavior

The issue I reported was tested with fly proxy e660f5c79 (2025-04-22). HAProxy has an option called httpclose which always adds the close header to the response and is also supposed to terminate the TCP connection. However, the upload was still continuing endlessly in the described setup. Now, with fly proxy f22fdaf3f (2025-04-30), I cannot reproduce this behavior, but with the reproducible example below, the fly proxy may still consume > 60 MiB of data.

Reproducible example

App

I’ve taken your Python example, expanded it a little, and deployed it as a test app on fly.io.

server.py

  import asyncio
  import os
  from aiohttp import web
  
  async def http_handler(request):
      try:
          response = web.StreamResponse(
              status=200,
              reason='OK',
              headers={
                  'Content-Type': 'text/plain',
                  'Connection': 'close',
              },
          )
  
          block_size = 1024
          await response.prepare(request)
          await response.write(f'Server msg: Read {block_size} bytes of data\n'.encode())
          await request.content.read(block_size)
          await response.write(b'Server msg: Wait before closing\n')
          await asyncio.sleep(5)
          await response.write(b'Server msg: Close connection\n')
          await response.write_eof()  # create zero block in chunked transfer encoding
          return response
      except Exception as e:
          return web.Response(text=f"Server exception: {e}", status=500)
  
  async def create_app():
      app = web.Application()
      app.router.add_put('/upload', http_handler)
      return app
  
  if __name__ == '__main__':
      HOST = os.getenv('HOST', 'localhost')
      PORT = int(os.getenv('PORT', '8000'))
      app = create_app()
      web.run_app(app, host=HOST, port=PORT)

requirements.txt

aiohappyeyeballs==2.4.4
aiohttp==3.10.11
aiosignal==1.3.1
async-timeout==5.0.1
attrs==25.3.0
frozenlist==1.5.0
idna==3.10
multidict==6.1.0
propcache==0.2.0
typing-extensions==4.13.2
yarl==1.15.2

fly.toml (partial, suboptimal)

[build]
  builder = 'paketobuildpacks/builder:base'

[env]
  PORT = '8080'
  HOST = '0.0.0.0'

[processes]
  app = "python server.py"

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0
  processes = ['app']

[[vm]]
  size = 'shared-cpu-1x'

Testing

I deployed the app to ‘5ak6optkvz-debug-flyproxy.fly.dev’ for testing.

Little data

Just type something like hello, when it is reading data.

cat | curl --raw -v --request PUT --http1.1 --upload-file . 'https://5ak6optkvz-debug-flyproxy.fly.dev/upload'
*   Trying 2a09:8280:1::73:7020:0:443...
* TCP_NODELAY set
* Connected to 5ak6optkvz-debug-flyproxy.fly.dev (2a09:8280:1::73:7020:0) port 443 (#0)
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: CN=*.fly.dev
*  start date: Apr 25 23:26:07 2025 GMT
*  expire date: Jul 24 23:26:06 2025 GMT
*  subjectAltName: host "5ak6optkvz-debug-flyproxy.fly.dev" matched cert's "*.fly.dev"
*  issuer: C=US; O=Let's Encrypt; CN=E6
*  SSL certificate verify ok.
> PUT /upload HTTP/1.1
> Host: 5ak6optkvz-debug-flyproxy.fly.dev
> User-Agent: curl/7.68.0
> Accept: */*
> Transfer-Encoding: chunked
> Expect: 100-continue
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Mark bundle as not supporting multiuse
< HTTP/1.1 100 Continue
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-type: text/plain
< connection: close
< transfer-encoding: chunked
< date: Sat, 03 May 2025 11:02:43 GMT
< server: Fly/f22fdaf3f (2025-04-30)
< via: 1.1 fly.io
< fly-request-id: 01JTAX04BKSVY1KRPG1Z6NX2GK-fra
< 
24
Server msg: Read 1024 bytes of data

hello
20
Server msg: Wait before closing

1D
Server msg: Close connection

0

* we are done reading and this is set to close, stop send
* Closing connection 0
* TLSv1.3 (OUT), TLS alert, close notify (256):

In this case, it just takes about 7 seconds for the client to receive the final zero chunk.

Much data

Massive zeros and measuring the data size using pv.

pv < /dev/zero | curl --raw -v --request PUT --http1.1 --upload-file . 'https://5ak6optkvz-debug-flyproxy.fly.dev/upload'
*   Trying 2a09:8280:1::73:7020:0:443...
* TCP_NODELAY set
* connect to 2a09:8280:1::73:7020:0 port 443 failed: No route to host
*   Trying 66.241.124.24:443...
* TCP_NODELAY set
* Connected to 5ak6optkvz-debug-flyproxy.fly.dev (66.241.124.24) port 443 (#0)
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: CN=*.fly.dev
*  start date: Apr 25 23:26:07 2025 GMT
*  expire date: Jul 24 23:26:06 2025 GMT
*  subjectAltName: host "5ak6optkvz-debug-flyproxy.fly.dev" matched cert's "*.fly.dev"
*  issuer: C=US; O=Let's Encrypt; CN=E6
*  SSL certificate verify ok.
> PUT /upload HTTP/1.1
> Host: 5ak6optkvz-debug-flyproxy.fly.dev
> User-Agent: curl/7.68.0
> Accept: */*
> Transfer-Encoding: chunked
> Expect: 100-continue
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Mark bundle as not supporting multiuse
< HTTP/1.1 100 Continue
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-type: text/plain
< connection: close
< transfer-encoding: chunked
< date: Sat, 03 May 2025 11:05:16 GMT
< server: Fly/f22fdaf3f (2025-04-30)
< via: 1.1 fly.io
< fly-request-id: 01JTAX4SQE62N8AZR4TTMWWMY6-fra
< 
24
Server msg: Read 1024 bytes of data

20
Server msg: Wait before closing

1D81MiB 0:00:01 [4.78MiB/s] [  <=>                                                                                                                           ]
Server msg: Close connection

09.7MiB 0:00:14 [5.89MiB/s] [                              <=>                                                                                               ]

* we are done reading and this is set to close, stop send
* Closing connection 0
* TLSv1.3 (OUT), TLS alert, close notify (256):
65.2MiB 0:00:15 [4.31MiB/s] [                                <=>

In this case, it seems to consume > 60 MiB after the Close connection message. Because the duration is quite similar to the first case, it seems that there is a 5 to 7 seconds limit for the consumption.

Summary

I think, the issue may be that the fly.io proxy somehow caches or delays the final zero chunk, so that it doesn’t arrive instantly. The problem is, that in the meanwhile, it is accepting data that may go nowhere. There seems to be a hardcoded time limit. I believe this shorter time limit was introduced recently, because basically observed endless consumption before. It could even be the case, that the fly proxy doesn’t send the final zero chunk before it detects the TCP connection close. With HAProxy I still see, that the zero chunk is only sent and the connection terminated, if the client tries to upload more data.

The fly.io deployment with the default builder was pretty slow, so I didn’t test removing the close header in the streaming response. I could imagine that the fly proxy may continue to consume more data, if the uploading client does not end the connection itself (we are done reading and this is set to close, stop send). I will check if this is a possible bug in curl, or if this is protocol-compliant behavior. In my understanding, observing the end of the response should also terminate the upload.

You are free to test using my debug deployment of the reproducible example. I will keep it up for a week or so.

2 Likes

@PeterCxy, can you confirm the issue?

I have adjusted the example and deployed it. There are different end points to test the closing behavior.

Adjusted example

import asyncio
import os
from aiohttp import web

async def http_handler(request, send_close_header, explicit_tcp_close):
    headers={'Content-Type': 'text/plain'}
    if send_close_header:
        headers['Connection'] = 'close'
    
    try:
        response = web.StreamResponse(
            status=200,
            reason='OK',
            headers=headers,
        )

        block_size = 1024
        seconds_wait = 5
        await response.prepare(request)
        await response.write(f'Server msg: Read {block_size} bytes of data\n'.encode())
        await request.content.read(block_size)
        await response.write(f'Server msg: Wait {seconds_wait} seconds before closing\n'.encode())
        await asyncio.sleep(5)
        await response.write(b'Server msg: Close connection\n')
        await response.write_eof()  # create zero block in chunked transfer encoding
        
        # Block shortly for the client to close the connection
        if send_close_header:
            await asyncio.sleep(1)
        
        # Force close the underlying transport
        if explicit_tcp_close:
            request.transport.close()
        
        return response
        
    except Exception as e:
        return web.Response(text=f"Server exception: {e}", status=500)

async def create_app():
    app = web.Application()
    app.router.add_put('/no-close', lambda r: http_handler(r, send_close_header=False, explicit_tcp_close=False))
    app.router.add_put('/client-http-close', lambda r: http_handler(r, send_close_header=True, explicit_tcp_close=False))
    app.router.add_put('/server-tcp-close', lambda r: http_handler(r, send_close_header=False, explicit_tcp_close=True))
    app.router.add_put('/all-close', lambda r: http_handler(r, send_close_header=True, explicit_tcp_close=True))
    return app

if __name__ == '__main__':
    HOST = os.getenv('HOST', 'localhost')
    PORT = int(os.getenv('PORT', '8000'))
    app = create_app()
    web.run_app(app, host=HOST, port=PORT)

Local behavior

Locally, only /no-close causes the client to hang for a little while, because if curl doesn’t receive a close header, it tries to continue with the upload, even thought it has received a finished response. Interestingly, the app also continues to consume data once the handler returns the TCP connection to the connection pool. So the example actually consumes over 6 GiB of data before the connection is terminated.

I think this kind of effect explains that the data is still going somewhere when deployed to fly.io.

pv /dev/zero | curl --raw --request PUT --http1.1 --upload-file . 'http://localhost:8000/no-close'
24
Server msg: Read 1024 bytes of data

2a
Server msg: Wait 5 seconds before closing

1d
Server msg: Close connection

0

curl: (55) Send failure: Connection reset by peer

--> consumed 6.20 GiB and terminated

With the closing endpoints, typically between 5 and 10 MiB end up being written to local buffers.

Fly.io deployment (5ak6optkvz-debug-flyproxy.fly.dev)

The 3 closing endpoints behave differently behind the fly.io proxy. The tested proxy version is bbaf6ebad (2025-05-06). It’s hard to pin issues when the fly proxy version changes over time.

/no-close → limited buffering, no close

This endpoint hangs after consuming about 60 MiB of data, after the response is finished. It then blocks without consuming more data. Since the server keeps the connection open and the client is not instructed to close the connection using the close header, I believe this behavior is ok.

pv /dev/zero | curl --raw --request PUT --http1.1 --upload-file . 'https://5ak6optkvz-debug-flyproxy.fly.dev/no-close'

Server msg: Read 1024 bytes of data

2A
Server msg: Wait 5 seconds before closing

1D
Server msg: Close connection

0

--> client hangs for very long or forever

/client-http-close → delayed zero chunk in response and some buffering

This endpoint works exactly like /no-close but the client receives the final zero chunk after a delay of about 5 to 7 seconds, and after uploading about 60 MiB. I’m not sure, what triggers this delay. When the client receives the zero chujnk, it then terminates the connection itself. It is strange, that the other content in the streaming response arrives instantly, but the terminating zero chunk is delayed. This delay leaves some room for the upload to continue unwantedly until some other limit kicks in.

pv /dev/zero | curl --raw --request PUT --http1.1 --upload-file . 'https://5ak6optkvz-debug-flyproxy.fly.dev/client-http-close'
2406MiB 0:00:01 [4.96MiB/s] [  <=>                                                                                                                           ]
Server msg: Read 1024 bytes of data

2A
Server msg: Wait 5 seconds before closing

1D
Server msg: Close connection

0

--> consumed 65 MiB and terminated

/server-tcp-close → instant zero chunk in response, but no termination

When the app finishes the response and terminates the TCP connection, the client actually receives the zero chunk, but tries to continue the upload because it is not instructed to close the connection instantly using the close header. I believe that the TCP connection break, which works locally, should be propagated properly by the fly proxy, leading to a ‘Connection reset by peer’ error.

pv /dev/zero | curl --raw --request PUT --http1.1 --upload-file . 'https://5ak6optkvz-debug-flyproxy.fly.dev/server-tcp-close'
24
Server msg: Read 1024 bytes of data

2A
Server msg: Wait 5 seconds before closing

1D
Server msg: Close connection

0

--> client hangs for very long or forever

/all-close → works best

The combination of both, the HTTP close header and the TCP connection termination, works pretty well with the fly proxy. It seems, that the TCP connection termination removes the delay in the zero chunk transmission. Still, if the client ignores the close header for some reason, the same issue as with /server-tcp-close would occur.

pv /dev/zero | curl --raw --request PUT --http1.1 --upload-file . 'https://5ak6optkvz-debug-flyproxy.fly.dev/all-close'
24
Server msg: Read 1024 bytes of data

2A
Server msg: Wait 5 seconds before closing

1D
Server msg: Close connection

0

--> consumed 6 MiB and terminated instantly

Summary

While buffering has undesired effects, the actual malfunctioning seems to lie in the handling of the TCP connection break, which is not propagated to the client. However, I would also expect the zero chunk to be delivered without delay, because server applications have good reasons not to close TCP connections after a single request. They are often held in a connection pool for reuse. In this case, the client should not be kept waiting for a response that has been finished by the server.

Thank you for the very detailed test cases! I am able to reproduce issues with delayed zero chunks, but not this case:

For me, the client here terminates ~instantly when the server closes the TCP connection:

24
Server msg: Read 1024 bytes of data

2A
Server msg: Wait 5 seconds before closing

1D46MiB 0:00:04 [ 117KiB/s] [   <=>                                         ]
Server msg: Close connection

0

* Send failure: Broken pipe [    <=>                                        ]
* closing connection #0
curl: (553.64MiB 0:00:06 [60.8KiB/s] [     <=>                                Send failure: Broken pipe
3.70MiB 0:00:06 [ 597KiB/s] [      <=>                                      ]

I’ll look into why the terminating zero chunk is delayed in the other test case.

I just ran this several times from two different computers on my network, and via a mobile connection: it always hangs:

curl -v --raw --request PUT --http1.1 --upload-file /dev/zero 'https://5ak6optkvz-debug-flyproxy.fly.dev/server-tcp-close' 
*   Trying 2a09:8280:1::73:7020:0:443...
* TCP_NODELAY set
* Connected to 5ak6optkvz-debug-flyproxy.fly.dev (2a09:8280:1::73:7020:0) port 443 (#0)
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: CN=*.fly.dev
*  start date: Apr 25 23:26:07 2025 GMT
*  expire date: Jul 24 23:26:06 2025 GMT
*  subjectAltName: host "5ak6optkvz-debug-flyproxy.fly.dev" matched cert's "*.fly.dev"
*  issuer: C=US; O=Let's Encrypt; CN=E6
*  SSL certificate verify ok.
> PUT /server-tcp-close HTTP/1.1
> Host: 5ak6optkvz-debug-flyproxy.fly.dev
> User-Agent: curl/7.68.0
> Accept: */*
> Transfer-Encoding: chunked
> Expect: 100-continue
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Mark bundle as not supporting multiuse
< HTTP/1.1 100 Continue
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-type: text/plain
< transfer-encoding: chunked
< date: Wed, 07 May 2025 14:17:33 GMT
< server: Fly/bbaf6ebad (2025-05-06)
< via: 1.1 fly.io
< fly-request-id: 01JTNHQRDVAWVJST4QNEESZN7V-fra
< 
24
Server msg: Read 1024 bytes of data

2A
Server msg: Wait 5 seconds before closing

1D
Server msg: Close connection

0

curl --version
curl 7.68.0 (x86_64-pc-linux-gnu) libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3
Release-Date: 2020-01-08
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp scp sftp smb smbs smtp smtps telnet tftp 
Features: AsynchDNS brotli GSS-API HTTP2 HTTPS-proxy IDN IPv6 Kerberos Largefile libz NTLM NTLM_WB PSL SPNEGO SSL TLS-SRP UnixSockets

Can you run your requests with -H 'flyio-debug: doit'? That’ll give us information about exactly which hosts your request went through, as I can’t seen to reproduce from my side.

It does seem like Connection: close + chunked EOF is not currently handled correctly. I’m going to look into getting that fixed.

I can do the debug request in approx. 1 hour.

1 Like

With special debug header:

curl -v --raw --request PUT --http1.1 --upload-file /dev/zero --header 'flyio-debug: doit' 'https://5ak6optkvz-debug-flyproxy.fly.dev/server-tcp-close'
*   Trying 66.241.124.24:443...
* TCP_NODELAY set
* Connected to 5ak6optkvz-debug-flyproxy.fly.dev (66.241.124.24) port 443 (#0)
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: CN=*.fly.dev
*  start date: Apr 25 23:26:07 2025 GMT
*  expire date: Jul 24 23:26:06 2025 GMT
*  subjectAltName: host "5ak6optkvz-debug-flyproxy.fly.dev" matched cert's "*.fly.dev"
*  issuer: C=US; O=Let's Encrypt; CN=E6
*  SSL certificate verify ok.
> PUT /server-tcp-close HTTP/1.1
> Host: 5ak6optkvz-debug-flyproxy.fly.dev
> User-Agent: curl/7.68.0
> Accept: */*
> Transfer-Encoding: chunked
> flyio-debug: doit
> Expect: 100-continue
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Mark bundle as not supporting multiuse
< HTTP/1.1 100 Continue
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-type: text/plain
< transfer-encoding: chunked
< date: Wed, 07 May 2025 17:12:43 GMT
< server: Fly/bbaf6ebad (2025-05-06)
< via: 1.1 fly.io
< fly-request-id: 01JTNVRCSZDX059RH91VBXJ5DQ-fra
< flyio-debug: {"n":"edge-cf-fra2-ad2e","nr":"fra","ra":"61.8.139.168","rf":"Verbatim","sr":"fra","sdc":"fra1","sid":"4d899947b40148","st":0,"nrtt":0,"bn":"worker-dp-fra1-acf8","mhn":null,"mrtt":null}
< 
24
Server msg: Read 1024 bytes of data

2A
Server msg: Wait 5 seconds before closing

1D
Server msg: Close connection

0

1 Like

I was able to try this out on the exact server that you hit, but unfortunately I was still unable to reproduce this behavior. Is it possible that your network is somehow dropping TCP RST packets? To me it looks like your client side just never got to know that the server has already closed the connection, but in fact fly-proxy has done that already.

That would be really unlikely: on my workstation on every network, cable-bound, wifi, and a mobile network, and on a different server running Debian. Let my try to check with more machines and looking at the TCP packets.

Two more tested machines in different networks (local and vserver) also hang, one resolving via ipv4 and one via ipv6. All machines run Linux and curl 7.88.1.

Are you able to tcpdump locally between your machine and the fly.dev IP and see if you receive an RST when the connection is supposed to be closed by the server side?

Hi @fungs, to give you an update here:

  1. The delayed EOF is an artifact of the connection pooling implementation we’re using internally (from hyper). We consider this behavior to be okay since the delay does not seem to be very long, and it is essential to keeping the pool working nicely (for reasons). To ensure the connection is closed in time, the application should close the underlying transport itself, as per RFC:

[… setting connection: close …] in either the request or the response header fields indicates that the sender is going to close the connection after the current request/response is complete (Section 6.6).

  1. In the second case, where your TCP connection stalls after the server side does close the connection: that seems to be a bug on our side caused by our firewall rules. More specifically, it happens due to a race between the FIN triggered by the HTTP-level chunked EOF, and the final close() call on the TCP socket. We’ve applied a workaround and you should no longer be seeing this.

Hey @PeterCxy, thanks for posting this update! I didn’t have the time to create the tcpdump, though I was pretty sure that the issue was on fly.io’s infrastructure side.

Delayed EOF

At least for my application, this delay is not a problem: first because it isn’t critical, and second, because my app can close the connection after sending the response. I also think, that the RFC doesn’t say anything about the timing, so there is room for interpretation and custom implementations.

Connection termination

Previous tests working ok

The explanations seems to be in line with my findings, and I can confirm that the demo app with the different test cases behaves better with the fix applied:

  • /no-close and /client-http-close both wait for the end of the streaming response while continuing to upload about 60 MiB. The former gets a broken pipe, whereas the latter closes gracefully client-side.
  • /server-tcp-close and /all-close both close instantly, again, the former with a broken pipe error, and the second gracefully client-side.

This is for the shown test cases in which the uploading client constantly keeps pushing data.

Streaming edge case might still be broken

I had already mentioned, that a client might not constantly push data in a streaming request. Consider that a client connects, sending some initial data nd then and waits before uploading more streaming data for 30 min. To simulate a similar case with the demo app, try

time (echo hello; cat) | curl -v --raw --request PUT --http1.1 --upload-file . 'https://5ak6optkvz-debug-flyproxy.fly.dev/all-close'

This edge case seems to work ok, if the server sends the EOF and closes the connection, and the client receives the EOF and then terminates the connection client-side, because it sees the server’s close header.

Now try this example with the endpoint, that also sends the EOF and closes the TCP connection, but doesn’t tell the client to terminate after this request.

time (echo hello; cat) | curl -v --raw --request PUT --http1.1 --upload-file . 'https://5ak6optkvz-debug-flyproxy.fly.dev/server-tcp-close'

Here, the client receives both, the response and EOF, but it doesn’t get notified that the TCP connection has already terminated. If it tries to send more data after 30 min (or longer), it gets a broken pipe error.

Next

Thanks again for fixing and being that responsive! I will no look into the last special case, trying to find out if this is expected behavior, or if it should be handled differently. I will also test the current release with HAProxy, to see, if the problems are also solved for the real-word case.