Staging hobby plan failing after V2 upgrade.

I have a staging replica of my production setup and the db and app are both not running after V2 upgrade. The proxy and search server in that setup are running. I just now tried a deploy that failed because it couldn’t connect to the db server which appears to be suspended. Is there an obvious fix ? My production setup is not yet migrated to V2 so I’m not sure what happened but makes me concerned about upgrading the production setup.

Hey could you paste some of the output of fly config show --app <yourapp> here?

{
  "app": "laspilitas-staging-db",
  "primary_region": "sjc",
  "env": {
    "PRIMARY_REGION": "sjc"
  },
  "mounts": [
    {
      "source": "pg_data",
      "destination": "/data"
    }
  ],
  "checks": {
    "pg": {
      "port": 5500,
      "type": "http",
      "interval": "15s",
      "timeout": "10s",
      "path": "/flycheck/pg"
    },
    "role": {
      "port": 5500,
      "type": "http",
      "interval": "15s",
      "timeout": "10s",
      "path": "/flycheck/role"
    },
    "vm": {
      "port": 5500,
      "type": "http",
      "interval": "10s",
      "timeout": "1m0s",
      "path": "/flycheck/vm"
    }
  },
  "metrics": {
    "port": 9187,
    "path": "/metrics"
  }
}

I haven’t been super active with flyctl lately. I was able to restart it with flyctl machine restart <machine_id>. It might be working now. I tried to start the app but it is dutifully killing its processes for being out of memory. I guess maybe the default memory sizing changed?

update: I was able to set the memory to 1024MB and it stopped killing all the processes off. Seems to be working.

This setup still has machines running in V1. How can I preemptively stop this from happening in my production setup?

I’m not aware of the defaults changing, no, but regardless of size defaults for the database instances, the migration should take the same specs from the v1 machines you were running so that is strange.

Do you know when the migration happened by chance? I believe you’d have gotten an email.

Sorry, I have multiple apps in the same staging “organization” to mirror my production setup. Among these are laspilitas-staging-app (python app) and laspilitas-staging-db (db app). I’m not sure if their issues are related or not.

The layout is like this:

  • laspilitas-staging-app: gunicorn+python with a cron job runner
  • laspilitas-staging-db: postgresql (fly.io’s integrated postgresql app)
  • laspilitas-staging-nginx: proxies to laspilitas-staging-app, serving cached media
  • laspilitas-staging-typesense: typesense search engine, used by python app

So as I understand the timeline is like this:

  • I received the successful migration email at Jul 18, 2023, 1:54 PM PST for the python app
  • I tried to deploy the python app on July 19 around 9:30 PM PST , this didn’t work because it couldn’t connect to the db app
  • Created topic to try to understand why db app had stopped or become suspended
  • Got db app to restart.
  • Got python app to start but then it had memory set to 256 or something low.
  • Reset memory to 1024MB and python app started up.

I’m not sure if the app was already not running before I tried to deploy again. I might have deployed with the old toml file. Is it possible to look up prior status changes?

Seems kind of weird because I am pretty sure I already had to bump the memory to even get the python app to run because it is kind of a hog given how I have it setup. Is there a way of checking if that was already set?

hello Ian, I couldn’t find traces of when the migration of the python app happened. They are marked as migrated by “Fly Bot” user in flyctl releases command output.

Seems the machine and the nomad alloc are competing to get a hold on the volume. We can do many things to restore the app but I would like to try the migration by:

  1. Terminate the running machine
  2. switch the app platform to Nomad
  3. let you run fly migrate-to-v2 --remote-fork command

as I said, the other option is to kill the nomad alloc and restart the machine so it picks the volume and gets going.

Sorry late responding, didn’t have access to CLI. Will this be the same flow I would use for my production python app ? That is my main concern, how I can migrate that without this happening or at least in a predictable way.

Also, what is the purpose of the --remote-fork argument?

Thanks.

Yes, the idea is to rollback the migration so you can rerun it on staging and test how it there the same as it would work in production.

You can ignore it, It was an beta feature that was promoted as default on flyctl v0.1.64.
The flag, now a default, enabled the migration to take snapshots and fork volumes to different source and destination hosts. Before, volume forks were only local, within the same host and could fail due to lack of host capacity.

That sounds great. Thanks I didn’t understand your strategy at first. I’m assuming you would perform steps 1 and 2 and then when I can I would perform step 3 and we would see what happens?

I have a fly-staging.toml configuration file that maybe I should update? I had used it for that deploy. It looks like this:

# fly.toml file generated for laspilitas-fly on 2022-06-01T11:12:44-07:00

app = "laspilitas-staging-app"

kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[build]
  dockerfile = "Dockerfile"

[env]
  RUN_TYPE="web"
  PORT = "8080"
  IMAGE_DIR="/app/data/files/images"
  MOVIE_DIR="/app/data/files/movies"
  SITEMAP_DIR="/app/data/files/sitemaps"
  BUNDLES_PATH="/app/frontend/bundles"
  QUALIFIED_HOST="https://www.***************"
  MIGRATION_MODE="1"
  TYPESENSE_PORT="8080"
  TYPESENSE_HOST="top2.nearest.of.laspilitas-staging-typesense.internal"
  TYPESENSE_PROTOCOL="http"
[mounts]
  source="laspilitas_staging_app_data"
  destination="/app/data"

The production fly toml file is pretty much the same file with env vars changed. So I’d have to try to update it exactly the same way or it would muddle things.

1 Like

Yes, you are right there. I will do (1) and (2) now for laspilitas-staging-app

Once I rollback the app to nomad platform, you try to deploy it on nomad and we double check that it works before starting the migration to machines.

Looks like it is up, can you check?

NOTE: you can ignore the “Status = suspended” line.

$ fly status
App
  Name     = laspilitas-staging-app
  Owner    = laspilitas-staging
  Version  = 73
  Status   = suspended
  Hostname = laspilitas-staging-app.fly.dev
  Platform = nomad

Instances
ID              PROCESS VERSION REGION  DESIRED STATUS  HEALTH CHECKS   RESTARTS        CREATED
d0fcc6a6        app     73      sjc     run     running                 0               1m28s ago

$ fly vol list
ID                      STATE   NAME                            SIZE    REGION  ZONE    ENCRYPTED       ATTACHED VM     CREATED AT
vol_3xme149qx2ovowpl    created laspilitas_staging_app_data     20GB    sjc     c0a5    true            d0fcc6a6        1 year ago

Yes, the staging setup seems to be working now.

edit Based on checking the external domain in a browser and also using flyctl ssh console -a laspilitas-staging-app and printing environment variables.

Awe! Please ensure you are running flyctl v0.1.64 and only then run fly migrate-to-v2 :crossed_fingers:

Thanks, this also seems to be working. How can I see how much memory is allocated?

Although how would the db migration interact with this migration? Would I use the same flow?

Great!

fly scale show or fly machine list should give you details about the machines.

Yes, but there is no easy way of roll back staging-db to the nomad platform.

I can only recommend you take a backup of your production backup and run:

mkdir prod-db
cd prod-db
fly config save -a laspilitas-production-db
fly migrate-to-v2

For multinode postgres clusters the migration won’t incur in downtime.

Just to follow up on this.

staging setup

Oddly enough the python machine now has 2048MB. So maybe that is what it was set to originally because earlier when it was going OOM I changed it from 256MB (not sure where that came from) to 1024MB before rolling it back and trying the manual method you suggested.

production setup

I have performed the migration for my 4 apps (nginx, typesense, gunicorn+python and postgresql) in my “production” organization. Everything seems to be working. Kind of odd that I have 2 db instances though because I don’t ever remembering having more than one but the volumes were created 10 months ago. I think I need to go either down or up and also upgrade the postgresql image. I need to read up on all this because so much has changed, but can a failover work with just 2 instances/machines?