Early look: PostgreSQL on Fly. We want your opinions.

Few updates to report:

  • New Postgres provisions will now default to version 14.1.
  • Single node Postgres provisions are now available! This configuration runs without Stolon and should be used for development purposes only.

As always, if you have any feedback or suggestions please let us know!

6 Likes

Would love to see more feature parity with render.com, specifically downloadable snapshots (zip/tar.gz/etc) in case the dbs need to be moved.

1 Like

I’ve noticed in the metrics there is always activity even when the DB is idle.

Is this caused by Fly’s monitoring?

I was thinking about

a) how great fly ssh console is, and
b) how some managed database providers have a command that drops you into a database console (like pscale shell)

Have you considered making a fly postgres console command that drops you into a postgres shell, that uses the secrets system to retrieve the database password or something?

1 Like

@julia Hey there! I went ahead and added a “connect” option to our latest Postgres images.

If you run fly image show --app <postgres-app> and update to the latest version, you’ll be able to run something like fly ssh console -C "connect" --app <postgres-app> to jump straight to the PG console.

We do have plans on streamlining this through a flyctl pg connect command, or something similar. No promises on when, but hopefully soon. :slight_smile:

3 Likes

@ pier

That or Stolon if you’re running the flyio/postgres image. You can dig a little deeper by querying the pg_stat_activity table.

1 Like

Amazing, thanks for the incredibly quick implementation!

1 Like

I created a development postgres app and I don’t see any metrics on its metrics page except “firecracker memory usage” and “data transfer” – are Postgres metrics not supported for development apps? I couldn’t find anything about it in the docs.

1 Like

It should include metrics, but it looks like the metrics exporter isn’t getting installed properly. We’ll get that fixed up soon, good catch.

@julia If you bump your image version to v0.0.5, it should now be working as expected. Sorry for the trouble!

If you upgrade flyctl to version 0.0.268, you will now see that a fly pg connect command has been made available.

 fly pg connect --app <postgres-app>

cc:// @julia

1 Like

Any idea why I’m getting this error when I try fly image update?

$ fly image update --app production-db
? Update `production-db` from flyio/postgres:13.4 v0.0.7 to flyio/postgres:13.5 v0.0.9? Yes
Error etcd has been disabled

@enaia Sorry about that, mind giving it another try?

That didn’t go well

$ fly image update --app production-db
? Update `production-db` from flyio/postgres:13.4 v0.0.7 to flyio/postgres:13.5 v0.0.9? Yes
Release v10 created

You can detach the terminal anytime without stopping the update
Monitoring Deployment

2 desired, 1 placed, 0 healthy, 1 unhealthy [health checks: 3 total]
v10 failed - Failed due to unhealthy allocations
Failed Instances

==> Failure #1

Instance
  ID            = f88b0035  
  Process       =           
  Version       = 10        
  Region        = iad       
  Desired       = run       
  Status        = pending   
  Health Checks = 3 total   
  Restarts      = 0         
  Created       = 21s ago   

Recent Events
TIMESTAMP            TYPE       MESSAGE                 
2021-12-14T22:58:36Z Received   Task received by client 
2021-12-14T22:58:54Z Task Setup Building Task Directory 

Recent Logs
2021-12-14T22:58:54.000 [info] Starting instance
2021-12-14T22:58:54.000 [info] Configuring virtual machine
2021-12-14T22:58:54.000 [info] Pulling container image
2021-12-14T22:58:56.000 [info] Unpacking image
2021-12-14T22:59:02.000 [info] Setting up volume 'pg_data'
2021-12-14T22:59:03.000 [info] Starting virtual machine
2021-12-14T22:59:03.000 [info] Starting init (commit: 7943db6)...
2021-12-14T22:59:03.000 [info] Mounting /dev/vdc at /data
2021-12-14T22:59:03.000 [info] 2021/12/14 22:59:03 listening on [fdaa:0:309a:a7b:ab9:0:30e5:2]:22 (DNS: [fdaa::3]:53)
2021-12-14T22:59:03.000 [info] panic: FLY_ETCD_URL or ETCD_URL are required
2021-12-14T22:59:04.000 [info] Main child exited normally with code: 2
2021-12-14T22:59:04.000 [info] Starting clean up.
2021-12-14T22:59:04.000 [info] Umounting /dev/vdc from /data
***v10 failed - Failed due to unhealthy allocations and deploying as v11 

Troubleshooting guide at https://fly.io/docs/getting-started/troubleshooting/

$ fly status -a production-db --all
App
  Name     = production-db          
  Owner    = enaia                  
  Version  = 10                     
  Status   = running                
  Hostname = production-db.fly.dev  

Deployment Status
  ID          = 01e50fd7-e3d4-8457-2c7a-87fbd10e38a8         
  Version     = v10                                          
  Status      = failed                                       
  Description = Failed due to unhealthy allocations          
  Instances   = 2 desired, 1 placed, 0 healthy, 1 unhealthy  

Instances
ID       PROCESS VERSION REGION DESIRED STATUS            HEALTH CHECKS      RESTARTS CREATED              
f88b0035 app     10 ⇡    iad    run     failed            3 total            0        4m2s ago             
21176624 app     9       iad    stop    complete          3 total, 3 passing 0        8h18m ago            
4faf45da app     9       iad    run     running (replica) 3 total, 3 passing 0        8h18m ago            
e8c6edd9 app     8       iad    stop    failed                               0        8h18m ago            
2e2fa531 app     8       iad    stop    failed                               0        8h18m ago            
9aa4ab49 app     8       iad    stop    failed                               0        8h19m ago            
57d90a7f app     8       iad    stop    failed                               0        8h19m ago            
e41064bc app     8       iad    stop    failed            3 total            0        8h19m ago            
64e284d2 app     8       iad    stop    failed            3 total            0        8h19m ago            
06171c90 app     7       iad    stop    complete          3 total, 3 passing 0        2021-12-02T22:27:08Z 
5d0b0ecf app     7       iad    stop    complete          3 total, 3 passing 0        2021-12-02T22:27:08Z 

There was a lingering environment variable that held this up. It should be good now.

Yep, looks good now. Thanks.

I believe we had a failure earlier today, and I’m wondering if there’s anything to be concerned about esp. given our recent experience (Database reset, 2 days of data lost - #8 by kurt)

$ fly status -a production-db --all
App
  Name     = production-db          
  Owner    = enaia                  
  Version  = 11                     
  Status   = running                
  Hostname = production-db.fly.dev  

Instances
ID       PROCESS VERSION REGION DESIRED STATUS            HEALTH CHECKS                  RESTARTS CREATED              
f74e8482 app     11 ⇡    iad    run     running (replica) 3 total, 3 passing             0        4h24m ago            
e12ac95b app     11 ⇡    iad    stop    failed            3 total, 2 passing, 1 critical 0        2021-12-14T23:06:17Z 
3398f650 app     11 ⇡    iad    run     running (leader)  3 total, 3 passing             0        2021-12-14T23:05:13Z 
f88b0035 app     10      iad    stop    failed            3 total                        0        2021-12-14T22:58:45Z 

$ fly vm status f88b0035 -a production-db
Instance
  ID            = f88b0035              
  Process       =                       
  Version       = 10                    
  Region        = iad                   
  Desired       = stop                  
  Status        = failed                
  Health Checks = 3 total               
  Restarts      = 0                     
  Created       = 2021-12-14T22:58:45Z  

Recent Events
TIMESTAMP            TYPE            MESSAGE                                           
2021-12-14T22:58:36Z Received        Task received by client                           
2021-12-14T22:58:54Z Task Setup      Building Task Directory                           
2021-12-14T22:59:03Z Started         Task started by client                            
2021-12-14T22:59:05Z Terminated      Exit Code: 2                                      
2021-12-14T22:59:05Z Not Restarting  Policy allows no restarts                         
2021-12-14T22:59:05Z Alloc Unhealthy Unhealthy because of failed task                  
2021-12-14T22:59:06Z Killing         Sent interrupt. Waiting 5m0s before force killing 

Checks
ID   SERVICE STATE   OUTPUT 
pg   app     warning        
role app     warning        
vm   app     warning        

Recent Logs
$ fly vm status e12ac95b -a production-db
Instance
  ID            = e12ac95b                        
  Process       =                                 
  Version       = 11                              
  Region        = iad                             
  Desired       = stop                            
  Status        = failed                          
  Health Checks = 3 total, 2 passing, 1 critical  
  Restarts      = 0                               
  Created       = 2021-12-14T23:06:17Z            

Recent Events
TIMESTAMP            TYPE             MESSAGE                                           
2021-12-14T23:06:12Z Received         Task received by client                           
2021-12-14T23:06:30Z Task Setup       Building Task Directory                           
2021-12-14T23:06:40Z Started          Task started by client                            
2021-12-16T17:33:52Z Restart Signaled healthcheck: check "vm" unhealthy                 
2021-12-16T17:33:56Z Terminated       Exit Code: 0                                      
2021-12-16T17:33:56Z Not Restarting   Policy allows no restarts                         
2021-12-16T17:33:56Z Killing          Sent interrupt. Waiting 5m0s before force killing 

Checks
ID   SERVICE STATE    OUTPUT                                                                                                                                                                                                                                                                                                                                                                                                                                                  
pg   app     passing  HTTP GET http://172.19.0.66:5500/flycheck/pg: 200 OK Output: "[✓] transactions: read/write (3.73ms)\n[✓] replicationLag: fdaa:0:309a:a7b:ab9:0:30e5:2 is lagging 0s (100ns)\n[✓] connections: 29 used, 3 reserved, 300 max (8.86ms)"                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
vm   app     critical HTTP GET http://172.19.0.66:5500/flycheck/vm: 500 Internal Server Error Output: "[✓] checkDisk: 9.09 GB (92.9%!)(MISSING) free space on /data/ (977.26µs)\n[✓] checkLoad: load averages: 0.14 0.22 0.25 (412.55µs)\n[✗] memory: system spent 1.03s of the last 10 seconds waiting on memory (54.88µs)\n[✗] cpu: system spent 1.09s of the last 10 seconds waiting on cpu (16.35µs)\n[✓] io: system spent 3.95s of the last 60s waiting on io (14.81µs)" 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
role app     passing  leader                                                                                                                                                                                                                                                                                                                                                                                                                                                  

Recent Logs

That seems to be failing the memory related health checks occasionally. This can happen when there’s not enough RAM for the database, but it might also be fine.

We just tweaked your DB to not restart when those health checks fail.

1 Like

Just wondering, is there a technical reason for needing the whole fly-replay stuff? I was wondering why these postgres replicas in different regions don’t have a thin layer to just forward the SQL to the main instance and forward the result back when its a write? This would remove the need for every developer to handle this stuff on their own

We wrote a bit about this on our Global Postgres announcement. The short answer is that it’s gross and unpredictable to do individual writes to a far away Postgres:

A bit more on “gross” here: you can get your database layer to do this kind of stuff for you directly, using something
like pgpool so that the database layer itself knows where to route transactions. But there’s a problem with this: your app doesn’t expect this to happen, and isn’t built to handle it. What you see when you try routing writes at the database connection layer is something like this:

  1. A read query for data, from read replica, perhaps for validation: 0ms. :metal:
  2. A write to the primary, in a different region: 20-400ms. :frowning_with_open_mouth:
  3. A read query to the primary, for consistency, in a different region: 20-400ms. :scream_cat:
  4. More read queries against primary for consistency, in a different region: 20-400ms. :scream:
  5. Maybe another write to the primary: 20-400ms. :dizzy_face:
  6. Repeat :skull_and_crossbones:.

It is much, much faster to ship the whole HTTP request where it needs to be than it is move the database away from an app instance and route database queries directly. Remember: replay is happening with Fly’s network. HTTP isn’t bouncing back and forth between the user and our edge (that would be slow); it’s happening inside our CDN.

That said, the fly-replay mechanism is designed to be easy to implement as a library. It’s almost magical in Rails and Phoenix, we just haven’t gotten to build the libraries for other frameworks yet.

Which are you using?