Progress update on scaling a Rails Application

Purpose

The purpose of this post is to start a conversation. Warning: this is a long post.

I have an app that I believe to be a good fit for fly.io. I’ve deployed a single instance of it. I have a good idea of how to scale it.

What I want to do is to use what I have learned to make fly.io even more approachable and the developer experience even better.

To drive this discussion, I’m going to alternate between describing a small demo app and the app that I want to deploy. Both the demo app and my app are Rails apps, but many of the items that are worth discussion are not framework specific.

Let’s start with the demo:

Rails Visitor Counter (a.k.a. welcome)

Most of you will want to get this up and running really quickly, so here are the steps:

git clone https://github.com/rubys/rails-visitor-counter.git
cd rails-visitor-counter
bundle install

bundle add fly.io-rails
bin/rails generate fly:app --passenger --anycable --redis --avahi
bin/rails deploy

The first three commands sets up a development environment. In fact, all you need to do to try the code out locally is bin/rails db:migrate followed by bin/dev. If you want to convince yourself that there are no fly.io specific gubbins in this code, follow the instructions in the readme instead.

The next three commands deploy the fly app, and will do so using Nginx, Phusion Passenger, Rails, Sqlite3, AnyCable RPC, AnyCable Go, Redis, and Avahi. That undoubtedly sounds like overkill for this simple application, but realize that I want to motivate a discussion of a larger application, and do so with running code that people can try out, poke holes at, suggest improvements, etc.

Visit the app using fly open. The counter will start at two as the first visit established that the site was alive. Open another window and both will now show three.

Before moving on, I want to give a hint of future direction. Excerpt from the generated fly.toml:

[processes]
app = "bin/rails fly:server[web=1;redis=1;anycable-rpc=1;anycable-go=1]"

What I would like to do next is something like:

[processes]
app = "bin/rails fly:server[web=1;anycable-rpc=1]"
ws = "bin/rails fly:server[redis=1;anycable-go=1]"

… and have it just work™. Two VMs, same image. Talking to one another. With only port 80 and 443 exposed to the world. My app will likely be three VMs, but two is enough for demo purposes.

Now for an overview of my application.

Dance Showcase Application

… a real world case study.

A local dance competition is when students from as few as three and as many as a baker’s dozen dance studios get together for a competition.

Ignoring the details, prior art was that an event equated to a spreadsheet, one with multiple tabs, roughly one tab per database table. This approach was labor intensive and error prone. I’ve replaced that with a Rails application with HTML forms, Hotwired Turbo, Tailwind, WebSockets, and a Sqlite3 database.

A second event is a second instance of the same Rails application, along with a second database. There are techniques for multi-tenancy within a single database, but frankly the complexity is not worth it.

Phusion passenger enables multiple Rails apps to be running on a single VM, each at a different URL path, each scaling to zero when not in use, and potentially the entire VM scaling to zero when none are in use.

If you think in terms of spreadsheets, updates are measured in transactions per minute or even hour, but responsiveness is in milliseconds. This is true for my app too. Supporting dozens of events on a single VM is no problem. But you want each VM to be close to the actual event venue for responsiveness.

WebSockets have unique requirements. For starters, a services.concurrency hard limit of 25 is completely unreasonable. Second, a Rails VM scaling to zero will rarely happen when somebody leaves their laptop on. For this reason a Anycable on a second VM is of interest.

Redis is similar. An upstash redis database is described as serverless and is not designed for long running sessions. As such it is quite appropriate for Rails caching, but not particularly a good match for ActionCable/AnyCable.

Rails apps are memory hungry. Having the app scale to zero addresses much of the issue; but there is one activity that can make memory spike - and that is report generation. Producing a webpage and talking users through how to use the various web browsers print dialogs and dealing with differences between browsers and operations systems isn’t worth it to me. Instead I generate PDFs using puppeteer and Chrome. And Chrome is also known to be RAM hungry.

Putting Chrome on a separate VM, with a trivial web server front end is totally doable, and having it, too, scale to zero using Phusion Passenger addresses scaling up when needed to meet demand.

That is scaling up a single region, what’s left is scaling out to multiple regions. And this turns out to be brain dead simple. The same pattern of three machines can be deployed to multiple locations. Each event will have a unique URL and a home region. Nginx at at each location can be configured to forward on request that don’t match the current location.

At the current time, I see no need for database or redis replication. In all I’m willing to accept the following compromises:

  • Roughly 2-3 second cold start time for the Rails app, after which time
    accesses are quick. In all, not too different than speadsheets.
  • Quick access from states surrounding an event, access from remote locations
    possible but perhaps laggy.

Back to the demo

I’ve shown a demo that can be run independently. I’ve described my current thinking about how my ultimate configuration will work. Now lets take a peek at what it takes to make this happen, by running git status:

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   Gemfile
	modified:   Gemfile.lock
	modified:   app/views/layouts/application.html.erb
	modified:   config/cable.yml
	modified:   config/environments/development.rb
	modified:   config/environments/production.rb

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	.dockerignore
	Dockerfile
	Procfile.fly
	config/anycable.yml
	config/fly.rb
	config/nginx.conf
	fly.toml
	lib/tasks/fly.rake

That’s a lot of files. To give just one example: if you want to use sqlite3 – and you really should give that database strong consideration, IMHO – you need a create a volume. You need to modify fly.toml to get that volume to mount. You need to move db:migrate from release time to just before startup. And you need to set an environment variable (possibly via a secret, but not necessarily) so that the Rails app knows where to find the DB.

That’s a bit more than the flyctl scanner was set up to handle. Until recently it only selected static files to add based on scans of the source code. I extended it to handle go templates, but for now I’m proceeding with interpreted erb tamplates as they are more powerful and I can iterate faster.

Now lets focus on deployment.

Even with two VMs sharing the same difference, there are differences. Their volume requirements will be different. I mentioned services.concurrency as needed to be higher on the websocket vm than the web server vm. Perhaps I will want different CPU profiles. Perhaps different environment variables. Some ports will be exposed to the outside world, others will be only accessible via the wireguard network.

I strongly suspect that I’ll quickly outgrow what can be done with the [processes] section of fly.toml. At the opposite extreme is terraform. It clearly can handle all of the requirements I’ve described above, but seems cumbersome. I feel like this is a Goldilocks problem and eventually will be wanting to create a DSL that abstracts out things that Rails developers will rarely need to touch (like the need to explicitly attach a https handler to port 443), from things they care about like volumes, secrets, and cpu sizes.

Moving processes between VMs should be a part of this process. That’s outside the scope of terraform, but clearly something that could be handled by a DSL. Processes like redis are found by URL, so moving the process should cause the new location to be discoverable based on the URL. I’m experimenting with avahi to make this happen.

Vision

I’ve got enough working that I could ditch the demo and proceed with scaling my application. But I really would like to see the developer experience of deploying an app to fly.io improve based on my experiences.

I’ve also been only focusing on the initial launch experience. Making changes to data providers, redis providers, websocket provides, etc should also be handled via things like thor. Or perhaps a web interface.

I’d like to have a conversation as to which parts of this are unique to my application, which parts are unique to rails, and which parts are common to many applications, including applications that are run on different frameworks.

If you have applications with similar or related needs, leave a comment. And
everybody has an open invitation to embarrass me. Point out something I missed
or am doing wrong. The purpose of this post is to learn and share.

Todos

Some things I’m currently looking into:

  • anycable-go and anycable-rpc want different url schems for redis URLs.
  • anycable-go apparently can’t resolve ipv5 avahi “local” domains
  • run passenger-config build-native-support during build time
  • set config.active_record.sqlite3_production_warning=false
2 Likes

Demo time!

For the impatient, here is a slew of commands:

git clone https://github.com/rubys/rails-visitor-counter.git welcome
cd welcome
bundle add fly.io-rails
bundle update
bin/rails generate fly:app --anycable --redis --passenger --nats
bin/rails deploy
fly open

If you run this, it will deploy a simple visitor counter application. The count will start at two as the first access was made to verify that the application is alive. Nothing too fancy, but if you open a second browser window to the same URL, both windows will show the new count.

The complete instructions on how the application is built can be found in the readme: rails-visitor-counter/README.md at main · rubys/rails-visitor-counter · GitHub

Per the generate fly:app parameters, instead of Puma, Action Cable, and Upstart Redis, this application uses Phusion Passenger, AnyCable, and installs a local copy of Redis. It also installs nats, which I will get to in a minute. There are plenty of other parameters, such as name, org, and region; and serverless will scale to zero, and litefs, but those are the subjects of other demos.

Now for the best part

Use your favorite editor to edit fly.toml. Change the processes section to read:

[processes]
  app = "bin/rails fly:server[web=1;redis=0;anycable-rpc=1;anycable-go=0]"
  ws  = "bin/rails fly:server[web=0;redis=1;anycable-rpc=0;anycable-go=1]"

What this defines is that you want two machines, with the four applications split across the two machines. A 1 means that that application will run on that machine, and a 0 means that the app may be needed by the other machine. redis, for example, is used by the web application; and anycable_rpc is used by the anycable_go application. Technically web=0; can be omitted from the second list, but whatever.

To deploy, run:

bin/rails deploy

I was unable to get avahi working for service discovery, and even if I could I read that it may take five or more minutes for things to publish, so I wrote my own using nats. Its not robust, and only demo quality, but hopefully can be replaced in the future by some proper service discovery facility. It may add up to 30 seconds to the startup, and if you look at the logs you may see one of the anycable processes starting up before redis, complaining that redis is not available, but retrying shortly thereafter connecting.

Everything works using Private Networking · Fly Docs

Some assumptions I made: port 80 and 443 go to the first machine, as does the the volume necessary to support sqlite3. Realistically, the zeros and ones aren’t the ideal interface, and the developer will want to be able to control volume and machine sizes, may want to expose other ports, may want to have multiple regions with different configurations on different regions, etc. Ultimately, this is probably a Goldilocks problem - the process section in the toml file may be too lightweight, something like terraform may be too heavyweight, and some DSL might be able to strike the right balance.

2 Likes