Guide: Elixir App v1 to v2 migration (It was easy)

Continuing the discussion from deployments failing. cc @jpramassini

Existing App

  • Pretty typical SaaS product built on Phoenix (API + backend) and React (SPA frontend)
  • Amazon RDS database proxied through pgbouncer + wireguard using setup from GitHub - fly-apps/rds-connector: Trivial Terraform example for a WireGuard peer to RDS.
    • I don’t know how to use terraform so I just made the resources by hand. Maybe AWS CDK scripting is possible for this?
    • Works very well, latency is good if you’re in same region
  • Single domain pointed by A and AAAA records on Route53
  • Deployed on Apps v1 (nomad) that has been having issues

End State

  • Exactly the same, except powered by machines

Process

Automatic migration is not yet available. Even if it were, I’d only use it if it created a completely separate app and then allowed me to test / cut over on my own. I have no reason to believe it would work otherwise, though.

  1. Followed instructions in Run a Machines App Using flyctl · Fly Docs to create a machine for my app. (Remember I already have the Dockerfile and know it’s setup right).
  2. Allocated IPs as per above post, and executed the run command.
  3. Generated new wireguard config and deployed it into my EC2 instance. This is because I took the opportunity to move my app to a real organization and wg is organization specific. Likely doesn’t affect most people
  4. Grab all secrets using env on the v1 machine. Then used pbpaste | fly secrets import to import it into the new app. This ensured the secrets didn’t get written to my machine nor appear in my console.
  5. Once app was running, I cloned the machine (follow instructions in scale docs) to get multi-instance. Plus, I set the CPU + memory to what I wanted.
  6. Debugged using the fly-specific dev domain. This meant I could test it without touching my main app.
  7. Found bug with POST requests and disabled [[statics]] as a workaround. Edit: Just read this is fixed.
  8. Created DNS entries for new.myapp.io and allocated HTTPS cert. This was to make sure everything was working from DNS perspective.
  9. Switched my fly.toml over to the new app name and deployed to Github where my actions would deploy the code (this verifies deploy pipeline works)
  10. At night, switched over my old app.myapp.io to app-old.myapp.io and moved new.myapp.io to app.myapp.io
  11. Leaving old app alone for a few days to make sure DNS changes propagate. Will decommission once I stop seeing web requests.

Verdict

Very easy to do, besides the bug that threw me for a few hours. I think you should ABSOLUTELY do this if you’re running on v1. Fly has publicly stated that the platform is not the future and that you should move.

11 Likes

Thanks for taking the time to write up your experience, this should be really helpful to us! Will try to add on here with our own once we’ve done it as well.

1 Like

Looks like this was just added to the docs for anybody who ends up on this thread in search of similar info: Migrate an Existing Fly App to Apps V2 · Fly Docs

2 Likes

Just curious since you’re using an elixir app. I got a single vm in 5 regions and the startup time is about 40 seconds. I’m getting about a minute or so of downtime on a deploy. Are you having any issues with the rolling deploys?

All of my VMs are in one region (at least right now). Mainly for DB latency.

Looks like each VM takes ~9 seconds to deploy. So 5 in 40s would align.

I’m not sure why that would result in downtime though. Does it roll in each region at the same time, or one region after another? I don’t know your requirements, but I’d lean towards 2 machines per region rather than 1 (even if that meant 3 regions instead of 5).

The startup time for each elixir app is roughly 30+ seconds, so having them all deploy in 40 seconds seems to be causing issues as it leaves some gaps for the proxy to re-balance. I’ve tried having multiple machines in the same region also, but that didn’t work either.

Are you having any downtime on deploy and how long does it take for elixir to come up?

Thank you, @sb8244! Case studies like this are a great resource and are in no way diminished by the publication of a generic guide in docs! I took your suggestion (and stole @jsierles’ script) and added an easier way to move secrets over, leveraging fly secrets import.

@jpramassini Would love to read how your migration goes, if you feel inclined to share.

2 Likes

My understanding is the rolling deploys won’t start the next until the health check passes. As such, there is no downtime during deployment for me. That said, I don’t know how this works multi-region. I believe it should still roll 1-by-1 because Machines seems to take a “no surprises” approach.

Do you have TCP / Health checks enabled for your app? If no, I could see that causing the issue because the machine will restart within a few seconds and then Fly will think it’s ready to go. With a health check, it should wait until it’s actually ready to go.

BTW: Make sure you have your app set to rolling and not immediate deployment.

Is it possible to scale the machine before I deploy?
I am asking because our app cannot start up fully with the default ram that is available.

Well, I did “fly deploy” to deploy a single machine into one region and then “fly m update --memory 1024 --select” and then “fly m clone --select --region new_region” .

Yeah, but for me deploy fails because the app can’t start due to going OOM before I could scale it.

This is a weird because I would expect the health checks to be continual.

fly checks list -a 

  NAME                     | STATUS  | MACHINE        | LAST UPDATED | OUTPUT   
---------------------------*---------*----------------*--------------*----------
  servicecheck-00-tcp-4000 | passing | 48e293c701489 | 5h19m ago    | Success  
---------------------------*---------*----------------*--------------*----------
  servicecheck-00-tcp-4000 | passing | 17811694b2273 | 5h20m ago    | Success  
---------------------------*---------*----------------*--------------*----------
  servicecheck-00-tcp-4000 | passing | 178195ef940484 | 5h20m ago    | Success  
---------------------------*---------*----------------*--------------*----------
  servicecheck-00-tcp-4000 | passing | 32871e1f303672 | 5h20m ago    | Success  
---------------------------*---------*----------------*--------------*----------
  servicecheck-00-tcp-4000 | passing | 568362dc7e448e | 5h20m ago    | Success  
---------------------------*---------*----------------*--------------*----------

And it is a rolling deploy, but the app completes before the health check even runs once.

0.6.24-phoenix-upgrade: digest: sha256:01650dd9794327605       
image size: 927 MB
Deploying stage-v2 app with rolling strategy
 
  Machine xxx [app] update finished: success
  Machine xxx [app] update finished: success
  Finished deploying
Cleaning up.

Adding a grace period didn’t help.

  [[services.tcp_checks]]
    grace_period = "25s"
    interval = "15s"
    restart_limit = 5
    timeout = "2s"

@sb8244 - can you post your config? Maybe I missed a setting somewhere.

Yah, so scale it after it OOMS? Mine must have OOM’d - 10 times before I realized. Eventually, it starts with the right amount of memory.

That is what I am trying, but I keep getting that Error no config changes found
If I do status on the machine I see this:

flyctl machine status 148ededf1e5289 -a APP_NAME
Machine ID: 148ededf1e5289
Instance ID: 01GWV0J59MZ3GJ749VV20AJJE4
State: started

VM
  ID            = 148ededf1e5289                                           
  Instance ID   = 01GWV0J59MZ3GJ749VV20AJJE4                               
  State         = started                                                  
  Image         = APP_NAME:deployment-01GWV0GRXBQJNYV8V4WT7PJ5J9  
  Name          = long-feather-2800                                        
  Private IP    = fdaa:0:5b60:a7b:f0f:c32f:c013:2                          
  Region        = sin                                                      
  Process Group = app                                                      
  CPU Kind      = shared                                                   
  vCPUs         = 1                                                        
  Memory        = 256                                                      
  Created       = 2023-03-31T04:59:37Z                                     
  Updated       = 2023-03-31T05:01:57Z                                     
  Command       =                                                          
  Volume        = vol_0enxv309o0xv8okp                             

And when I try to scale it:

flyctl machine update 148ededf1e5289 -a APP_NAME -s shared-cpu-2x
Searching for image 'registry.fly.io/APP_NAME:deployment-01GWV0GRXBQJNYV8V4WT7PJ5J9@sha256:f48d9bc6b0556eabeb95a698cc2119af50e2c7316cef873b91f89b97a6b30bfe' remotely...
image found: img_e1zd4m9dkklv02yw
Image: registry.fly.io/APP_NAME-v2:deployment-01GWV0GRXBQJNYV8V4WT7PJ5J9
Image size: 151 MB

Error no config changes found

I get the same if I try to change the memory directly.

This is the exact command I ran. Give it a try.

fly m update -a APP_NAME --memory 1024 --select

Thanks, same result:

fly m update -a APP_NAME --memory 1024 --select
? Select a machine: 148ededf1e5289 long-feather-2800 (stopped, region sin, process group 'app')
Searching for image 'registry.fly.io/APP_NAME:deployment-01GWV0GRXBQJNYV8V4WT7PJ5J9@sha256:f48d9bc6b0556eabeb95a698cc2119af50e2c7316cef873b91f89b97a6b30bfe' remotely...
image found: img_e1zd4m9dkklv02yw
Image: registry.fly.io/APP_NAME:deployment-01GWV0GRXBQJNYV8V4WT7PJ5J9
Image size: 151 MB

Error no config changes found

Perhaps stop it and then try it? Wonder if this is a cli bug.

Edit: Or perhaps delete and start from scratch? Seems weird.

Edit2: Or pick another value for memory like 512 and see if that works.

The machine is stopped.
I did start from scratch, I tried the same thing yesterday.
I tried with 512 as you suggested, same result.

Try with another region? Maybe SIN is busted?

$ fly m update --memory 512 --select
Update available 0.0.499 → v0.0.500.
Run “flyctl version update” to upgrade.
? Select a machine: 328735ec33d598 silent-sky-9817 (started, region ord, process group ‘app’)
Searching for image ‘registry.fly.io/consul-test:deployment-01GWQ6SXWXH3KGCH2MWY8N6WWJ@sha256:a0ec7e0e0f65b6e018667aa8cbfcd3adf1cf3ba4f8725d934c94027a646cbe26’ remotely…
image found: img_3xdk4xy6gm7pgo0e
Image: registry.fly.io/consul-test:deployment-01GWQ6SXWXH3KGCH2MWY8N6WWJ
Image size: 68 MB

Configuration changes to be applied to machine: 328735ec33d598 (silent-sky-9817)

    ... // 3 identical lines
            },
            "init": {},
  •           "image": "registry.fly.io/consul-test:deployment-01GWQ6SXWXH3KGCH2MWY8N6WWJ@sha256:a0ec7e0e0f65b6e018667aa8cbfcd3adf1cf3ba4f8725d934c94027a646cbe26",
    
  •           "image": "registry.fly.io/consul-test:deployment-01GWQ6SXWXH3KGCH2MWY8N6WWJ",
              "metadata": {
                      "fly_platform_version": "v2",
      ... // 2 identical lines
                      "fly_release_version": "37"
              },
    
  •           "restart": {},
    
  •           "restart": {
    
  •                   "policy": "always"
    
  •           },
              "services": [
                      {
      ... // 26 identical lines
                      "cpu_kind": "shared",
                      "cpus": 1,
    
  •                   "memory_mb": 256
    
  •           }
    
  •                   "memory_mb": 512
    
  •           },
    
  •           "dns": {}
      }
    

? Apply changes? (y/N)

@tj1 Looks like you found a bug where flyctl was only waiting on top-level checks and not service checks during a rolling deploy. Sorry about that! @dangra says “a fix is in the oven”! Also wait for service checks when updating machines by dangra · Pull Request #1979 · superfly/flyctl · GitHub

2 Likes