This is only possible on fly...

Whenever I check the community forum, it’s usually just a bunch of problems. So I thought I’d post something positive (and really cool).

Here’s the worldwide outbound traffic on a liveview application over the past 7 days that has been running on fly for 6+ months with 18 minutes downtime on v1+stolon postgres (cutover to v2 last week and hopefully cutting over to sqlite/litefs shortly.)

It’s an application trying to give a desktop experience to users worldwide and latency matters.

image
For those that don’t know, the blue/purple are exit nodes and the yellow-red are app servers (looks like I need an app server in Australia now and perhaps move Miami to Mexico). 30GB of traffic / month, about 50-100k liveview interactions/day, and 150 concurrent connections.

For $29/month, this is absolutely mind-bottling .

I wrote this in July '21.

Now after all this time, I’m pretty confident you guys aren’t going to mess this up, but seriously…please don’t mess this up. This is only possible on fly.io. Thanks and great work.

43 Likes

Really cool, thanks!

Right, people come here looking for help, so it makes sense that it looks like a list of problems :sweat_smile: (there’s also announcements from Fly! I think I heard that we might start emailing those out as well! Imagine that - email!)

4 Likes

That’s really cool!

Are you using PG replicas? Do you have a repo to see how you set it up?

Are you the guy using Svelte to replicate macOS or something similar? :thinking:

Yah, how true…hardly anyone posts when something is working. :smiley:

Haha, no. I don’t have the required brains to be able to use javascript frameworks successfully. Just using Elixir/Phoenix Liveview. There’s a lot of optimistic UI, but there are quite a bit of data changes/calculations that need to be done server-side. Users were previously on desktop, so there’s a certain expectation of speed.

That was my initial plan last August, but after reviewing stolon setup and replication…I had no confidence I could successfully recover from a failure. PG replication has always been a bit of a pain and fly postgres is a layer of abstraction that is not quite managed, so it seemed like an unnecessary risk at time of review compared to the alternative.

So, instead of that, in preparation for moving to fly, I did the following:

  • moved several gb’s of data out of postgres that were read-only into sqlite
  • baked these sqlite db’s into the docker image. This made the image about 900MB, but it makes the management much easier.
  • setup all the elixir nodes in a cluster (libcluster | Hex)
  • wrote a simple distributed cache to manage data from the master (only took like 4 hours to be honest)

After all these changes, the load on the primary was < 1 qps and even less on writes.

For the rest of the data, I have been waiting on litefs. I wrote a library to send writes to primary as fly-replay can’t be used with liveview. Converted my staging environment a few days ago using the fly multi-tenant consul and just waiting for litefs 0.4 to drop to start working on swapping over production.

Litefs failover is comprehensible and recovery is straightforward. The only tricky bit is consul and whether to use my own cluster or the multi-tenant one provided by fly. (Aside: it would be nice to not lock replies after 7 days as then posts can be updated with relevant information. Would be handy to update Small consul cluster + litefs failover with the setup. )

@benbjohnson has said litefs can handle on the order of 10’s of writes per second due to the fuse overhead. I thought it would be more since sqlite can handle between 4k to 40k on a single thread, but even 10’s of writes per second is fine.

In my testing, litefs has performed quite well at 0.3 with the consul lease. On deploys, writes are disabled for less than a few seconds as failover happens. You may need more features, but I have been rolling my own backups (for pg and now sqlite) and don’t need the write proxy that is available in 0.4. It obviously depends on your application and your data patterns, but I’ve been using it in production for read-only data for 1+ years and it is has performed better than postgres did.

So my recommendation is to take a serious look at sqlite+litefs rather than trying to do multi-region postgres.

5 Likes

@pier - forgot to mention, I also tried out a few other solutions back then before deciding to wait on litefs.

  1. Planetscale. I converted my dev environment and played with it for a week. But their pricing is based on row-reads and at the end of the day I didn’t want to deal with it.
  2. Neon.tech - they were in alpha/beta. I signed up for them, but they never got back to me.
  3. Crunchy - wasn’t sure if they did multi-region.
  4. ElephantSQL - wasn’t sure if they are multi-region.

What kind of data are you dealing with?

3 Likes

Just regular relational data.

I’m mostly looking at improving the performance of the dashboard where users manage their content, upload stuff, etc. The primary PG is in AMS. The latency worldwide is not bad, but I’d be happy to improve read perf with edge replicas.

I have another distributed service using the same DB that is only reading some of the data and eventually will have lots of traffic. Already solved this with an in-memory cache and a pub/sub system to purge/update that edge cache.

1 Like

The benchmarks on that page seem to test pure CPU overhead as they removed a bunch of safety & durability:

  • File system is tmpfs so it’s all in-memory
  • No syscalls for locks; mutex-free locking patch was applied
  • SYNCHRONOUS pragma is set to off

In practice, I find that SQLite can do a few thousand write transactions per second when the database is a decent size (~1GB) and there are a handful of DML operations per transaction. LiteFS isn’t heavily optimized on the write side yet but I would expect it to do between 100-200 write tx/sec through FUSE once we do optimize it. Once we add a VFS extension for LiteFS, I would expect that to handle 1k-2k write tx/sec as it will avoid the FUSE overhead.

On the read side, LiteFS should largely be the same as regular SQLite since it uses the OS page cache and mostly avoids the FUSE layer. (I mentioned some of this in the issue but I’m adding here for anyone else perusing the forum)

Since you’re on apps v2, you could run with a static lease and designate a single machine to be your primary. You’ll briefly lose write availability on deploy when that node is restarted but restarts are quite fast on Fly Machines.

2 Likes

@tj1 thanks for sharing your story. So exciting!

Could you elaborate on how your distributed cache works?

I’ve tried implementing a distributed cache with Elixir a number of ways (single global genserver, Cachex, if memory serves ets is not distribute but mnesia is). Was there a particular approach that you took that you had production level success with?

Also a follow up on the several gigs of data that you moved into SQLite. Is that data effectively static and basically never updated? Just trying to create an image in my head about why it worked in your case.

Thanks!

2 Likes

What a great case study! Thanks for sharing it.

LiteFS looks super cool, but I’m not keen on rolling my own backups, so I’m holding out until the S3 replication feature gets added :slight_smile:

1 Like

It is a GenServer running on each node and is using Cachex for gets and sets with expiry and Phoenix.PubSub to broadcast to the other nodes. I have a small dataset that is being cached, so I wanted it to be identical on each of the nodes to avoid any extra db trips. It is also completely and totally overkill and only did it cause it was quick.

To give you an idea, this is a small tidbit.

  defp set_session_cache_and_broadcast(guid, prefs) do
    # To avoid local race conditions with the pubsub broadcast, we set locally first
    set_session_cache(guid, prefs)
    Phoenix.PubSub.broadcast(Hora.PubSub, "cache", {:set_session_cache, guid, prefs})
  end

  def handle_info({:set_session_cache, key, prefs}, _state) do
      set_session_cache(key, prefs)
    {:noreply, nil}
  end

Yes, once the data has been created it will never be altered. This will work for anyone that has immutable or historical records. Previously, at day job, we did rather ludicrous things with sqlite. For instance, during off-hours, we compacted and squashed their data generated during the day into sqlite and then proceeded to store it on S3. If they were going to work on that particular project, it’s quick to grab the data the files required and store them on the app server. Anyway, wasn’t my idea, but the performance improvements were quite absurd on both our app and db servers. So I think this pretty much applies to all datasets to be honest, it’s just a question of whether it’s worth doing or not.

Exqlite.Sqlite3.execute(conn, "vacuum into '#{backup_path}'")

You know you want to do it. It’s tempting you. Feel the power. Join the dark side/lite side.

1 Like

Yes, the high concurrency branch of sqlite is not representative. Like you mentioned, in my benchmarking of unpatched sqlite for my data, I believe it was a few thousand qps. I didn’t pay too much attention to it because it is several orders of magnitude more than I need.

As I’m solo, the failure mode that I’m trying to guard against is that sometimes I am away from electronics for a while and unable to respond to alerts (6 to 72 hours). Temporary write failures are acceptable as long as they recover cleanly without manual intervention.

I’d prefer not to have to write a runbook and hire a company to respond to pages (though I’m probably too small for them to do that anyway), so I’m trying to understand the implications of the static lease vs consul vs multi-tenant consul managed by fly.

  1. Static lease - from my understanding, if a machine fails (with or without a volume), it is not brought up again anywhere else. If this is true, that means that the static lease will not failover automatically, which means it could be down for days due to me not being available.
  2. Self-managed consul - 3x nodes - If a single machine may go down due to hw failure and not be moved, that means that there needs to be 3 x instances of consul running in a single location. Since it is a simple lease mechanism, there is no need for retaining history or having a volume, so we don’t have to worry about that. At least 2/3 should be up…most of the time for quorum. If consul is down, litefs will be read-only on boot. However, if there is a node that has a lease and consul is unreachable, does it retain the lease?
  3. Self-managed consul - 1x nodes - if a machine fails and it can be reliably brought up without a volume on another node, then this should suffice. There will be some write downtime while it is being restarted.
  4. Multi-tenant consul (fly-managed).
    On v2, the consul url can be acquired by using GraphQL Playground and the following mutation:
    mutation{ enablePostgresConsul(input:{appId: "your_app_id"}) { consulUrl } }

I’m leaning towards 4 right now. Even though it is multi-tenant, there is a team who is actually watching it. 6 hours of write degradation is much preferable to me being on a 24 hour flight and not responding for 30+ hours.

Now if a v2 machine (not going back to v1) can be reliably restarted on a different node without a volume in any region, I would actually be inclined to go to option 3 of consul in dev mode only accessible via 6PN.

[aside]
Going down the rabbit hole, for this particular use-case of ensuring that everything just works, single-node redis for the leasing could be even more reliable. No need for any secondaries and the following will set the key “someapp-lease” only if it doesn’t exist and will expire in a minute. No need for redlock algorithm.

redis> SET someapp-lease "will expire in a minute" NX EX 60

[/aside]

Given the current v2 implementation and that I’m a solo project owner who may be away from computer/phone for long-periods, I’m considering the correct choice is fly-managed multi-tenant consul to ensure the most reliability. I’m sure you guys have a giant roadmap somewhere, but is there any other information that could be useful here?

Yes, that’s true.

No, it has to renew the lease every few seconds. If the node can’t reach Consul then it has to assume that it no longer has the lease.

Is this any more reliable than running a single Consul node?

I think you have a good handle on everything. For your use case, I think the multi-tenant Consul works best. Static leasing would be an ok option if you can handle some write downtime when a catastrophic failure happens on the primary (which is rare) and you can respond quickly to move to a new primary, but that doesn’t sound like the case for you.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.