Database reset, 2 days of data lost

We have a database cluster set up, and after deploying our app about an hour ago our database was reset to the state it was in over 48 hours ago. Any idea how this could have happened?

Edit: just to be clear, I understand that your Postgres support is in beta.

Can you give us more details on what you’re seeing? The only way I can imagine data reverting on a Postgres setup like this is if the primary node was broken somehow, replication to the secondary wasn’t caught up, and then a failover happened making the now-stale data the primary.

You should check the status of your postgres VMs:

fly status -a <db-name> --all

Do you see any that failed? You can run fly vm status <id> -a <db-name> to possibly see why.

You can also restore snapshots to compare both versions:

fly volumes list -a <db-name>
fly volumes snapshots list <volume-id>
fly pg create --snapshot-id <snapshot-id>

It would be worth comparing snapshots from both your volumes to see if they have the same data.

$ fly status -a production-db --all
App
  Name     = production-db          
  Owner    = enaia                  
  Version  = 5                      
  Status   = running                
  Hostname = production-db.fly.dev  

Instances
ID       PROCESS VERSION REGION DESIRED STATUS            HEALTH CHECKS                  RESTARTS CREATED              
3eda45a3 app     5       iad    run     running (replica) 3 total, 3 passing             0        1h14m ago            
c11ee22a app     5       iad    run     running (leader)  3 total, 3 passing             0        1h21m ago            
5e662043 app     5       iad    stop    failed            3 total, 1 passing, 2 critical 4        1h38m ago            
df175262 app     5       iad    stop    failed            3 total, 1 passing, 2 critical 4        2h3m ago             
7059760b app     5       iad    stop    failed            3 total, 1 passing, 2 critical 4        2h27m ago            
7d437e7d app     5       iad    stop    failed            3 total, 1 passing, 2 critical 4        2h51m ago            
09ea9763 app     5       iad    stop    failed            3 total, 1 passing, 2 critical 4        3h16m ago            
a8e27476 app     5       iad    stop    failed            3 total, 1 passing, 2 critical 4        3h40m ago            
fc3caa1e app     5       iad    stop    failed            3 total, 1 passing, 2 critical 4        4h5m ago             
53a6f1eb app     5       iad    stop    failed            3 total, 1 passing, 2 critical 4        4h29m ago            
fc86da43 app     5       iad    stop    failed            3 total, 1 passing, 2 critical 4        4h54m ago
[...]
$ fly vm status 5e662043 -a production-db
Update available 0.0.256 -> v0.0.260.
Run "fly version update" to upgrade.
Instance
  ID            = 5e662043                        
  Process       =                                 
  Version       = 5                               
  Region        = iad                             
  Desired       = stop                            
  Status        = failed                          
  Health Checks = 3 total, 1 passing, 2 critical  
  Restarts      = 4                               
  Created       = 1h44m ago                       

Recent Events
TIMESTAMP            TYPE             MESSAGE                                                         
2021-12-02T19:32:12Z Received         Task received by client                                         
2021-12-02T19:32:12Z Task Setup       Building Task Directory                                         
2021-12-02T19:32:13Z Started          Task started by client                                          
2021-12-02T19:34:00Z Restart Signaled healthcheck: check "pg" unhealthy                               
2021-12-02T19:34:01Z Terminated       Exit Code: 0                                                    
2021-12-02T19:34:02Z Restarting       Task restarting in 1.24606516s                                  
2021-12-02T19:34:08Z Started          Task started by client                                          
2021-12-02T19:35:53Z Restart Signaled healthcheck: check "role" unhealthy                             
2021-12-02T19:35:56Z Terminated       Exit Code: 0                                                    
2021-12-02T19:35:56Z Restarting       Task restarting in 1.006636038s                                 
2021-12-02T19:36:02Z Started          Task started by client                                          
2021-12-02T19:37:50Z Restart Signaled healthcheck: check "role" unhealthy                             
2021-12-02T19:37:52Z Terminated       Exit Code: 0                                                    
2021-12-02T19:37:52Z Restarting       Task restarting in 1.066671546s                                 
2021-12-02T19:37:57Z Started          Task started by client                                          
2021-12-02T19:39:44Z Restart Signaled healthcheck: check "pg" unhealthy                               
2021-12-02T19:39:45Z Terminated       Exit Code: 0                                                    
2021-12-02T19:39:46Z Restarting       Task restarting in 1.1424448s                                   
2021-12-02T19:39:51Z Started          Task started by client                                          
2021-12-02T19:41:37Z Restart Signaled healthcheck: check "pg" unhealthy                               
2021-12-02T19:41:39Z Terminated       Exit Code: 0                                                    
2021-12-02T19:41:40Z Not Restarting   Exceeded allowed attempts 2 in interval 5m0s and mode is "fail" 
2021-12-02T19:41:40Z Killing          Sent interrupt. Waiting 5m0s before force killing               

Checks
ID   SERVICE STATE    OUTPUT                                                                                                                                                                                                                                                                                                                                                                                                 
vm   app     passing  HTTP GET http://172.19.2.146:5500/flycheck/vm: 200 OK Output: "[✓] checkDisk: 9.15 GB (93.6%) free space on /data/ (67.78µs)\n[✓] checkLoad: load averages: 0.00 0.00 0.00 (71.46µs)\n[✓] memory: system spent 0s of the last 60s waiting on memory (42.04µs)\n[✓] cpu: system spent 72ms of the last 60s waiting on cpu (27.69µs)\n[✓] io: system spent 168ms of the last 60s waiting on io (23.3µs)" 
                                                                                                                                                                                                                                                                                                                                                                                                                             
role app     critical failed to connect to local node: context deadline exceeded                                                                                                                                                                                                                                                                                                                                             
pg   app     critical HTTP GET http://172.19.2.146:5500/flycheck/pg: 500 Internal Server Error Output: "failed to connect with local node: context deadline exceeded"                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                                                                                                             

Recent Logs

That’s crashing quite frequently, apparently because the health checks can’t connect to the database. But it appears it’s been happening for much longer than a few hours.

Will you please give us more information about what you’re seeing that indicates the data is 48 hours old? And try recovering those snapshots to see if they have more recent data.

Will you please give us more information about what you’re seeing that indicates the data is 48 hours old?

It’s clear from our application data. Our application is in use, and I can tell by looking at timestamps that we lost activity from the last 2 days.

2021-12-02 20:04:35
[...missing data...]
2021-11-30 15:30:52

I’m also not in a position to recover from the snapshots. I’m waiting for a demo to finish, and I have a more recent manual backup that I’m planning to restore when I’m able to.

Which table is that? We’ll have a look at snapshots. Can you look at your manual backup and give us a table + ID to look for that includes data you’re missing?

The most recent snapshots on your DB are 23 hours old, so if one of the nodes was lagging it should should in those.

production=# select last_visited from users where last_visited is not null order by last_visited desc limit 20;
    last_visited     
---------------------
 2021-12-02 21:39:22
 2021-12-02 21:29:08
 2021-12-02 21:07:22
 2021-12-02 21:05:53
 2021-12-02 21:04:29
 2021-12-02 20:53:16
 2021-12-02 20:04:35
 2021-11-30 15:30:52
 2021-11-30 02:17:23
 2021-11-29 23:36:44
 2021-11-29 22:38:14
 2021-11-29 22:24:55
 2021-11-29 19:17:44
 2021-11-29 13:40:36

This might be better:

production=# select inserted_at from activities order by inserted_at desc limit 30;
     inserted_at     
---------------------
 2021-12-02 21:17:35
 2021-12-02 21:08:30
 2021-12-02 21:08:21
 2021-12-02 21:08:17
 2021-12-02 21:08:16
 2021-12-02 20:55:13
 2021-12-02 20:55:10
 2021-12-02 20:55:08
 2021-12-02 20:35:07
 2021-12-02 20:11:13
 2021-12-02 20:11:07
 2021-12-02 20:03:36
 2021-11-30 15:55:09
 2021-11-30 15:54:42
 2021-11-30 15:30:19
 2021-11-30 15:24:16
 2021-11-30 15:23:19
 2021-11-30 15:17:34
 2021-11-30 15:17:04
 2021-11-30 14:57:10
 2021-11-30 12:56:11
 2021-11-30 12:56:07
 2021-11-30 12:55:22
 2021-11-30 12:52:06
 2021-11-30 12:51:39
 2021-11-30 02:17:27
 2021-11-30 02:15:38
 2021-11-29 22:26:10
 2021-11-29 22:26:08
 2021-11-29 22:26:00

Ok it does look like one of the machines wasn’t replicating properly for quite a while. The most recent updated at in the good snapshot is 2021-12-01 21:23:34, the most recent updated at in the other is 2021-11-30 17:32:57. So it seems like what was primary failed sometime today, and the out of date replica took over.

We’ve changed your database to not restart when health checks fail. This might mean it stops responding, but it should make it obvious what’s wrecking the VMs.