I have a staging replica of my production setup and the db and app are both not running after V2 upgrade. The proxy and search server in that setup are running. I just now tried a deploy that failed because it couldn’t connect to the db server which appears to be suspended. Is there an obvious fix ? My production setup is not yet migrated to V2 so I’m not sure what happened but makes me concerned about upgrading the production setup.
Hey could you paste some of the output of fly config show --app <yourapp>
here?
{
"app": "laspilitas-staging-db",
"primary_region": "sjc",
"env": {
"PRIMARY_REGION": "sjc"
},
"mounts": [
{
"source": "pg_data",
"destination": "/data"
}
],
"checks": {
"pg": {
"port": 5500,
"type": "http",
"interval": "15s",
"timeout": "10s",
"path": "/flycheck/pg"
},
"role": {
"port": 5500,
"type": "http",
"interval": "15s",
"timeout": "10s",
"path": "/flycheck/role"
},
"vm": {
"port": 5500,
"type": "http",
"interval": "10s",
"timeout": "1m0s",
"path": "/flycheck/vm"
}
},
"metrics": {
"port": 9187,
"path": "/metrics"
}
}
I haven’t been super active with flyctl
lately. I was able to restart it with flyctl machine restart <machine_id>
. It might be working now. I tried to start the app but it is dutifully killing its processes for being out of memory. I guess maybe the default memory sizing changed?
update: I was able to set the memory to 1024MB and it stopped killing all the processes off. Seems to be working.
This setup still has machines running in V1. How can I preemptively stop this from happening in my production setup?
I’m not aware of the defaults changing, no, but regardless of size defaults for the database instances, the migration should take the same specs from the v1 machines you were running so that is strange.
Do you know when the migration happened by chance? I believe you’d have gotten an email.
Sorry, I have multiple apps in the same staging “organization” to mirror my production setup. Among these are laspilitas-staging-app
(python app) and laspilitas-staging-db
(db app). I’m not sure if their issues are related or not.
The layout is like this:
- laspilitas-staging-app: gunicorn+python with a cron job runner
- laspilitas-staging-db: postgresql (fly.io’s integrated postgresql app)
- laspilitas-staging-nginx: proxies to laspilitas-staging-app, serving cached media
- laspilitas-staging-typesense: typesense search engine, used by python app
So as I understand the timeline is like this:
- I received the successful migration email at Jul 18, 2023, 1:54 PM PST for the
python app
- I tried to deploy the
python app
on July 19 around 9:30 PM PST , this didn’t work because it couldn’t connect to thedb app
- Created topic to try to understand why
db app
had stopped or become suspended - Got
db app
to restart. - Got
python app
to start but then it had memory set to 256 or something low. - Reset memory to 1024MB and
python app
started up.
I’m not sure if the app was already not running before I tried to deploy again. I might have deployed with the old toml file. Is it possible to look up prior status changes?
Seems kind of weird because I am pretty sure I already had to bump the memory to even get the python app
to run because it is kind of a hog given how I have it setup. Is there a way of checking if that was already set?
hello Ian, I couldn’t find traces of when the migration of the python app happened. They are marked as migrated by “Fly Bot” user in flyctl releases
command output.
Seems the machine and the nomad alloc are competing to get a hold on the volume. We can do many things to restore the app but I would like to try the migration by:
- Terminate the running machine
- switch the app platform to Nomad
- let you run
fly migrate-to-v2 --remote-fork
command
as I said, the other option is to kill the nomad alloc and restart the machine so it picks the volume and gets going.
Sorry late responding, didn’t have access to CLI. Will this be the same flow I would use for my production python app ? That is my main concern, how I can migrate that without this happening or at least in a predictable way.
Also, what is the purpose of the --remote-fork
argument?
Thanks.
Yes, the idea is to rollback the migration so you can rerun it on staging and test how it there the same as it would work in production.
You can ignore it, It was an beta feature that was promoted as default on flyctl v0.1.64.
The flag, now a default, enabled the migration to take snapshots and fork volumes to different source and destination hosts. Before, volume forks were only local, within the same host and could fail due to lack of host capacity.
That sounds great. Thanks I didn’t understand your strategy at first. I’m assuming you would perform steps 1 and 2 and then when I can I would perform step 3 and we would see what happens?
I have a fly-staging.toml
configuration file that maybe I should update? I had used it for that deploy. It looks like this:
# fly.toml file generated for laspilitas-fly on 2022-06-01T11:12:44-07:00
app = "laspilitas-staging-app"
kill_signal = "SIGINT"
kill_timeout = 5
processes = []
[build]
dockerfile = "Dockerfile"
[env]
RUN_TYPE="web"
PORT = "8080"
IMAGE_DIR="/app/data/files/images"
MOVIE_DIR="/app/data/files/movies"
SITEMAP_DIR="/app/data/files/sitemaps"
BUNDLES_PATH="/app/frontend/bundles"
QUALIFIED_HOST="https://www.***************"
MIGRATION_MODE="1"
TYPESENSE_PORT="8080"
TYPESENSE_HOST="top2.nearest.of.laspilitas-staging-typesense.internal"
TYPESENSE_PROTOCOL="http"
[mounts]
source="laspilitas_staging_app_data"
destination="/app/data"
The production fly toml file is pretty much the same file with env vars changed. So I’d have to try to update it exactly the same way or it would muddle things.
Yes, you are right there. I will do (1) and (2) now for laspilitas-staging-app
Once I rollback the app to nomad
platform, you try to deploy it on nomad and we double check that it works before starting the migration to machines.
Looks like it is up, can you check?
NOTE: you can ignore the “Status = suspended” line.
$ fly status
App
Name = laspilitas-staging-app
Owner = laspilitas-staging
Version = 73
Status = suspended
Hostname = laspilitas-staging-app.fly.dev
Platform = nomad
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
d0fcc6a6 app 73 sjc run running 0 1m28s ago
$ fly vol list
ID STATE NAME SIZE REGION ZONE ENCRYPTED ATTACHED VM CREATED AT
vol_3xme149qx2ovowpl created laspilitas_staging_app_data 20GB sjc c0a5 true d0fcc6a6 1 year ago
Yes, the staging setup seems to be working now.
edit Based on checking the external domain in a browser and also using flyctl ssh console -a laspilitas-staging-app and printing environment variables.
Awe! Please ensure you are running flyctl v0.1.64 and only then run fly migrate-to-v2
Thanks, this also seems to be working. How can I see how much memory is allocated?
Although how would the db migration interact with this migration? Would I use the same flow?
Great!
fly scale show
or fly machine list
should give you details about the machines.
Yes, but there is no easy way of roll back staging-db to the nomad platform.
I can only recommend you take a backup of your production backup and run:
mkdir prod-db
cd prod-db
fly config save -a laspilitas-production-db
fly migrate-to-v2
For multinode postgres clusters the migration won’t incur in downtime.
Just to follow up on this.
staging setup
Oddly enough the python machine now has 2048MB. So maybe that is what it was set to originally because earlier when it was going OOM I changed it from 256MB (not sure where that came from) to 1024MB before rolling it back and trying the manual method you suggested.
production setup
I have performed the migration for my 4 apps (nginx, typesense, gunicorn+python and postgresql) in my “production” organization. Everything seems to be working. Kind of odd that I have 2 db instances though because I don’t ever remembering having more than one but the volumes were created 10 months ago. I think I need to go either down or up and also upgrade the postgresql image. I need to read up on all this because so much has changed, but can a failover work with just 2 instances/machines?