Bottomless S3-backed volumes

Fly Volumes provide persistent storage to your app by exposing a slice of a local NVMe drive. This works well in many cases, but it requires a “provision-first” approach: you need to know roughly what you’re going to use in advance (yes, you can manually extend your disks, even without restarting your machine, but you cannot yet shrink them back). Also, there’s only so much space available to a single volume on a single server: right now, the cap is 500 GiB. If you have terabyte upon terabyte of data to store, especially if it’s not frequently accessed, then you’ll need to look for alternatives.

One option might be an object storage service like S3. They can store a virtually limitless amount of data, they have great durability, and they’re cost-effective. There’s a good chance that you’re already using one. And if your app already supports object storage directly, great!

But what if your app doesn’t — like, say, your Postgres database? Or perhaps it would just be easier to store your data as files on a disk, but with the capacity and durability a you can get from object storage.

With this in mind, we wanted to know whether disk volumes backed by object storage were possible. So we read up on it and built an experimental version based on recent research into log-structured virtual disks (LSVD). (That’s why you’ll see a lot of lsvd commands below — we’re not committed to the name though :smile:.) As a proof-of-concept, we even got a 100 GiB (and growing) Postgres database running on it—check it out here.

Now it’s your turn to try it out—we want to know what you think!

:warning: NB: Just a quick reminder before you proceed—this is an experimental feature. It’s not ready for production use, and it’s not officially supported. If you run into problems, please get in touch with us here on the community forum.

How can I try it?

Before starting, make sure that you have the latest flyctl version (you’ll need features introduced in v0.1.103).

Create a bucket

First, you’ll need to create a bucket on S3 or a compatible object storage service. You’ll also need credentials that can read, write, and list objects in that bucket. You can do this however you’d like, but for S3 itself, you can check out the open-source s3-credentials tool. Specifically, you can use the --create-bucket and --bucket-region flags with s3-credentials create to get a new bucket created along with credentials for a new AWS user that has access to it.

:warning: NB: Create your new bucket in a region close to the Fly.io region from which you plan to use it. This is important—it will keep the latency of I/O operations down! For S3, this live RTT table can help you choose.

Set up your Fly app

Since this is an experimental feature, we strongly recommend creating a fresh app with fly apps create to test this, rather than using an existing one.

Once you have your bucket and credentials, you can run

fly volumes lsvd setup -a <your-app-name>

It’ll prompt you for the relevant information and set secrets on your app to configure it.

Create an LSVD-enabled Fly Machine

Use the new --lsvd flag with fly machines run to create a new Machine in your app. This will inject and start a background daemon (called lsvd) in your Machine that actually provides the volume. It will be mounted at the path you specified in the previous (setup) step. (The raw volume will be exposed as /dev/nbd0 for those of you who want to get real fancy/crazy with the raw block device.)

Make sure to give your Machine enough memory (--vm-memory), because the lsvd daemon will need some of it. 512 MiB is a good baseline. Larger disks will require more memory: the lsvd process currently needs 2 MiB of memory per GiB of disk. (We realize that this is a lot of overhead; we hope to reduce it in the future!)

:warning: NB: Before deploying your machine, you’ll need the CA certificates available in your Docker image, so that the lsvd daemon can connect to object storage over HTTPS. They’re usually easy to add. Here’s an example Dockerfile line for Debian-based images (with apt-get):

RUN apt-get -y update \
    && apt-get -y install ca-certificates \
    && apt-get -y clean \
    && rm -rf /var/lib/apt/lists/*

In summary, this is the full command to get you started with a local Dockerfile:

fly machines run -a <app-name> --lsvd --vm-memory 512 .

:warning: NB: For now, don’t run more than one LSVD-enabled machine per app! They’ll conflict with each other and corrupt your volume.


Nerdy stuff below :arrow_down:, for all of you nerds :nerd_face:

Diving deeper: what are all these objects showing up in S3?

Once you’ve run your shiny new S3-backed volume for a while, you’ll notice objects with funny hexadecimal names appearing in your bucket. Each object contains logs of the writes made to the disk. They’re numbered sequentially.

Each log entry records both what part of the disk was written and the actual data that was written. To read back a sector of the disk, you can scan the logs to find the most recent write to that sector and pull the data from that entry.

This basic idea can ultimately be optimized enough to make it practical (for more on this, check out the paper!), and it has a really nice quality: snapshotting is “built-in” to the design. To restore the volume to a given point, all you need to do is ignore all the log entries that come after that point. You can try this yourself:

  1. Write a bunch of data to the disk. Shut the machine down and write down the name of the most recent log object.
  2. Start the machine again, write even more data, and then shut it down again.
  3. Delete all the log objects that come after one you found in step (1).
  4. Start the machine one last time. Observe that the disk state rolled back to how it was at the end of step (1).

But what’s the performance like?

There’s no getting around the fact that accessing data from S3 (or a compatible service) will have a much higher latency (tens to hundreds of milliseconds) than a locally attached SSD (tens to hundreds of microseconds). However, like an SSD (and unlike a hard drive), S3 can handle many requests simultaneously. Accordingly, we’ll use up to 100 connections at once to access the backend. We also tune the Linux kernel in your Machine to read ahead aggressively during sequential reads to provide reasonable throughput even with high latency. From our testing, up to a few thousand IOPS and transfer rates on the order of 10 to 100 MiB/s are achievable.

The tradeoffs won’t make sense for every app! However, if your application issues many I/O operations concurrently, deals with a large amount of cold data, and can work with the latency, we think that this is an option worth exploring.

:warning: NB: For improving performance, here are some additional things to consider:

  • If you’re running a database, then it may be possible to tune it. E.g., Postgres users might try setting effective_io_concurrency to 100 to match the number of concurrent I/O operations available.
  • If you’re feeling intrepid, then you can experiment with locally caching hot data via the Linux kernel’s dm-cache feature, which is built into our guest kernel. If interested, let us know below and we can share some additional information about this!

Acknowledgments

This feature is based on research presented in the LSVD paper and earlier work on block devices in userspace. You can find software published with the papers on GitHub: asch/dis, asch/buse, and asch/bs3.

Furthermore, we built the lsvd program that gets added to your VM using a number of open-source libraries—you can see the full list by running /.fly/lsvd licenses in your LSVD-enabled Machine.

24 Likes

Well this sounds amazing.

Just wanted to make sure I’m understanding right…

If I have an app that mounts a volume on fly with say a 100 gb limit, but then use this service to back it up to s3, what is that 100 gb used for? Would it just be better to have a 1 gb volume that is kind of a pass-through to s3 or does volume size still matter for some sort of temp file thing?

If I have a mounted volume next to the app, it feels like I’m using the local filesystem. That would still be true even though the files are hosted on s3 right?

If I had 2 servers, each with a volume backed by s3, I could have either volume backup to s3 and/or access files from s3? (I ask because I think fly apps can’t share volumes?)

1 Like

Hi @gdub01! Thanks for asking.

To clarify, for now these experimental S3-backed volumes are separate from the existing “Fly Volumes” product based on local NVMe drives. They do not back up or mirror a local NVMe Fly Volume to S3. Rather, when you set up LSVD, you get a virtual disk attached to your Machine whose data is stored and accessed entirely on S3, with no local Fly Volume involved.

Since disks are expected to have a fixed size, you need to configure one for your S3-backed volume. However, only data that are written to the volume are sent to S3. If you create a 1 TiB S3-backed volume but write only 1 GiB of data to it, then your S3 bucket will have ~1 GiB of data in it. If you create a 1 TiB S3-backed volume and overwrite the entire disk twice, then your S3 bucket will have ~2 TiB of data in it. (A future iteration of this would be able to clean overwritten data from S3 and bring that back down toward 1 TiB.)

Exactly. From the perspective of your Machine, the S3-backed volume is just another disk drive. If you configure a mount point when you run fly volumes lsvd setup, then we’ll automatically format the drive with a filesystem and mount it for you, and you can use it like any other filesystem.

Admittedly, I’m not quite sure what you’re asking (sorry about that! :sweat_smile:), but I hope that this might help:

  • Like existing Fly Volumes, these S3-backed volumes are designed to attach to a single Fly Machine. Making these work as shared disks isn’t something we’ve thought much about yet.

  • However, since the data is stored remotely in S3, you can destroy your LSVD-enabled Fly Machine and create a new one, and your volume should continue to work on the new Machine—even if it’s on a different physical host, or even in a different region!

  • The fly volumes lsvd setup tool configures your app to run one single LSVD-enabled Machine (and therefore one S3-backed volume). However, it’s possible to run n LSVD-enabled Machines (so n independent S3-backed volumes) by pointing each Machine at its own S3 bucket. If anyone here wants to try this, feel free to bug me for details here! (Or perhaps one of you will figure it out by reading the flyctl source code. :wink:)

3 Likes

JFYI, Azure Blobs seem to be considerably faster than S3, like 3-5x better in terms of latencies. I know this because we use a similar system based on Rclone.

Another useful feature for such kind of volumes is end-to-end encryption. We use that too (provided by Rclone).

2 Likes

Whoa! What a fab sounding feature!

Are there clients we can use to read and write that data from outside of Fly? For example, for disaster recovery.

3 Likes

@lpil we haven’t built an external client, but thanks for asking—it’s making me think about what kind of tooling a “full-scale” version of this might need! :slight_smile:

For now, I can at least share the (current) format of the S3 objects so that you know how your data is actually stored:

Each object starts with a fixed-size (12 KiB) header. The header consists of 1,024 12-byte entries, each of which describes a single write to the volume. The entries have the following fields, in this order:

  1. An unsigned 64-bit big-endian integer giving the address of the write (in bytes), where 0 is the first byte of the volume
  2. An unsigned 32-bit big-endian integer giving the size of the write (in bytes)

The implementation uses a sector size of 4 KiB, so in practice all of the addresses and sizes will be multiples of 4,096. An entry with a size field of 0 indicates that not all 1,024 of the header entries are used, and that it (and all subsequent header entries) should be ignored.

After the 12 KiB header, you’ll find the actual data associated with each write, stored in the same order as the header entries.

:warning: NB: When generating the log objects, this implementation will merge/flatten/defragment overlapping and adjacent writes. For some workloads, this significantly reduces the amount of data stored in S3 and improves the fragmentation situation, but the trade-off is that the entries in the objects are no longer a literal log of the writes made to the disk. As a result, stopping in the middle of an object when replaying the log will not in general reproduce a disk state that ever actually existed. However, stopping in between objects will give you a true/valid snapshot.

3 Likes

Is An io_uring-based user-space block driver [LWN.net] available in the fly machine?

Not quite yet! I’ve actually played around with this a bunch. It’s available in the 6-series kernel which we haven’t generally switched to yet unfortunately, but when we do upgrade, I plan to make it available.

4 Likes

Will/would Fly charge me for egress when writing to the filesystem?

1 Like

This sounds great!

Any info on pricing?

1 Like

EDIT

I figured out what was needed–realized the mount is through an ext4 filesystem on top of the volume, so I ssh’d in and ran resize2fs /dev/nbd0 and it worked!


Original post:

This is really cool! It’s perfect for a specific use case of mine: running ArchiveBox in a way that lets me capture large amounts of data, but without needing a huge volume allocated + attached all the time.

Some feedback: Tried it out by loading up the drive until it was full, initially 3gb worth. Then went and tried to increase the size by updating the FLY_LSVD_DEVICE_SIZE secret, and now it’s having trouble booting. Here are the logs:

2023-10-11T00:27:15.917 app[3d8d7669b1dd68] lax [info] [ 0.044189] Spectre V2 : WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible via Spectre v2 BHB attacks!
2023-10-11T00:27:15.975 app[3d8d7669b1dd68] lax [info] [ 0.069450] PCI: Fatal: No config space access function found
2023-10-11T00:27:16.173 app[3d8d7669b1dd68] lax [info] INFO Starting init (commit: 5d9c42f)...
2023-10-11T00:27:16.200 app[3d8d7669b1dd68] lax [info] INFO [fly api proxy] listening at /.fly/api
2023-10-11T00:27:16.203 app[3d8d7669b1dd68] lax [info] 2023/10/11 00:27:16 listening on [fdaa:1:53b:a7b:1e4:e173:d8c1:2]:22 (DNS: [fdaa::3]:53)
2023-10-11T00:27:16.208 app[3d8d7669b1dd68] lax [info] lsvd: Starting (commit 58f00b1).
2023-10-11T00:27:16.228 app[3d8d7669b1dd68] lax [info] lsvd: Starting recovery from the archive-zim-dev bucket.
2023-10-11T00:27:18.347 app[3d8d7669b1dd68] lax [info] lsvd: Recovery complete (processed 448 objects).
2023-10-11T00:27:18.449 app[3d8d7669b1dd68] lax [info] lsvd: Archived 0 objects after the recovery point.
2023-10-11T00:27:18.450 app[3d8d7669b1dd68] lax [info] lsvd: Connecting the kernel to the device.
2023-10-11T00:27:18.462 app[3d8d7669b1dd68] lax [info] lsvd: Device is now operational.
2023-10-11T00:27:18.462 app[3d8d7669b1dd68] lax [info] lsvd: Mounting /dev/nbd0 on /data.
2023-10-11T00:27:19.238 app[3d8d7669b1dd68] lax [info] INFO Preparing to run: `dumb-init -- /app/bin/docker_entrypoint.sh archivebox server --quick-init 0.0.0.0:8000` as root
2023-10-11T00:27:20.369 app[3d8d7669b1dd68] lax [info] find: ‘/home/archivebox/.config/chromium/Crash Reports/pending/’: No such file or directory
2023-10-11T00:27:21.470 app[3d8d7669b1dd68] lax [info] [i] [2023-10-11 00:27:21] ArchiveBox v0.6.3: archivebox server --quick-init 0.0.0.0:8000
2023-10-11T00:27:21.470 app[3d8d7669b1dd68] lax [info] > /data
2023-10-11T00:27:22.085 app[3d8d7669b1dd68] lax [info] find: ‘/home/archivebox/.config/chromium/Crash Reports/pending/’: No such file or directory
2023-10-11T00:27:22.327 app[3d8d7669b1dd68] lax [info] OSError: [Errno 28] No space left on device
2023-10-11T00:27:22.327 app[3d8d7669b1dd68] lax [info] During handling of the above exception, another exception occurred:
2023-10-11T00:27:22.327 app[3d8d7669b1dd68] lax [info] Traceback (most recent call last):
2023-10-11T00:27:22.327 app[3d8d7669b1dd68] lax [info] File "/usr/local/bin/archivebox", line 33, in <module>
2023-10-11T00:27:22.327 app[3d8d7669b1dd68] lax [info] sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
2023-10-11T00:27:22.329 app[3d8d7669b1dd68] lax [info] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-10-11T00:27:22.329 app[3d8d7669b1dd68] lax [info] File "/app/archivebox/cli/__init__.py", line 140, in main
2023-10-11T00:27:22.329 app[3d8d7669b1dd68] lax [info] run_subcommand(
2023-10-11T00:27:22.329 app[3d8d7669b1dd68] lax [info] File "/app/archivebox/cli/__init__.py", line 74, in run_subcommand
2023-10-11T00:27:22.329 app[3d8d7669b1dd68] lax [info] setup_django(in_memory_db=subcommand in fake_db, check_db=cmd_requires_db and not init_pending)
2023-10-11T00:27:22.329 app[3d8d7669b1dd68] lax [info] File "/app/archivebox/config.py", line 1249, in setup_django
2023-10-11T00:27:22.329 app[3d8d7669b1dd68] lax [info] with open(settings.ERROR_LOG, "a", encoding='utf-8') as f:
2023-10-11T00:27:22.329 app[3d8d7669b1dd68] lax [info] OSError: [Errno 28] No space left on device
2023-10-11T00:27:23.245 app[3d8d7669b1dd68] lax [info] INFO Main child exited normally with code: 1
2023-10-11T00:27:23.245 app[3d8d7669b1dd68] lax [info] INFO Starting clean up.
2023-10-11T00:27:23.246 app[3d8d7669b1dd68] lax [info] lsvd: Received signal. Starting to shut down gracefully.
2023-10-11T00:27:23.246 app[3d8d7669b1dd68] lax [info] lsvd: Looking for mounted filesystems on the device.
2023-10-11T00:27:23.247 app[3d8d7669b1dd68] lax [info] lsvd: /dev/nbd0 is mounted on /data; unmounting.
2023-10-11T00:27:23.248 app[3d8d7669b1dd68] lax [info] WARN hallpass exited, pid: 305, status: signal: 15 (SIGTERM)
2023-10-11T00:27:23.249 app[3d8d7669b1dd68] lax [info] lsvd: Disconnecting the kernel from the device.
2023-10-11T00:27:23.253 app[3d8d7669b1dd68] lax [info] 2023/10/11 00:27:23 listening on [fdaa:1:53b:a7b:1e4:e173:d8c1:2]:22 (DNS: [fdaa::3]:53)
2023-10-11T00:27:23.262 app[3d8d7669b1dd68] lax [info] lsvd: Flushing remaining writes to object storage.
2023-10-11T00:27:23.779 app[3d8d7669b1dd68] lax [info] lsvd: Device has shut down.
2023-10-11T00:27:23.784 app[3d8d7669b1dd68] lax [info] WARN lsvd exited, pid: 304, status: exit status: 0
2023-10-11T00:27:24.786 app[3d8d7669b1dd68] lax [info] [ 8.877815] reboot: Restarting system
2023-10-11T00:27:25.012 runner[3d8d7669b1dd68] lax [info] machine restart policy set to 'no', not restarting
1 Like

@kot @pier at this (experimental) stage there’s no special pricing for this feature, so for billing purposes the lsvd process in your Machine is indistinguishable from your app itself. Therefore, outbound data transfer from the lsvd process is billed as usual. Likewise, if you increase your Machine’s CPU or RAM to accommodate the lsvd process, that will be billed as usual. There’s no cost for simply enabling the lsvd process though!

It’s hard to say this early what might be different pricing-wise for future a GA version.

@mileszim it’s awesome to see you using this, and I’m glad that you figured it out! (Thanks for posting the solution too!)

4 Likes

I tried to use it with Cloudflare R2 and it just failed with this error. Maybe because it needs to use the region value “global”?

thread 'main' panicked at 'removal index (is 0) should be < len (is 0)', init/src/main.rs:548:25

By the way, if anyone wants to use this feature as shared volume, I would recommend Cloudflare R2 to many people because it works like the earth region. Of course, it may depend on your technical limitations :man_shrugging:

@smorimoto I believe that that happens when no command is set for your machine to run (either explicitly through flyctl or the Machines API, or with a CMD or ENTRYPOINT in the Dockerfile). If you’re still encountering this, can you double-check that?

(Also, when I get the chance, I’ll see about replacing that crash with a proper error message!)

@MatthewIngwersen Indeed, that fixed the thing! But I’m encountering a new one… I will come back here if it doesn’t get fixed with some more debugging.

Anyway, this is really great and I can’t wait to see this released as a stable release soon!

Is there anything interesting in the works here now with Tigris being integrated on Fly? @MatthewIngwersen: I’m guessing you already know about https://objectivefs.com/ and https://cuno.io/? Those are proprietary software but perhaps can provide some inspiration for either architecture or performance benchmarks?

I think the killer feature would be if you could provision a smallish volume per machine that could act as a cache that’s backed by Tigris, allowing each machine to see the same, globally (eventually?) consistent, bottomless filesystem, but with 0 latency access (and fewer billed Tigris requests) to frequently read files :slight_smile:

5 Likes

Bottomless storage backed into Fly machines would open up many interesting use cases for sure.

2 Likes

Sorry for the (very) late reply here! We haven’t done anything specific with S3-backed volumes + Tigris yet, but I’m hoping to make some time to test-drive the combination soon.

Caching on smaller local volumes is something on our radar too. @Alex21, if you’re willing, I’d love to hear if there are any specific use-cases or problems that made you think about this feature. (It helps us figure out what we should prioritize building!) And thanks for the links too!

Also just wanted to note that I pushed out a very small update yesterday—the lsvd binary that runs in your Machines used to depend on the GNU C library, so it wouldn’t work with some images (particularly anything Alpine-based). This has been fixed. You should now be able to run S3-backed volumes on Alpine images, images built FROM scratch, even registry.k8s.io/pause if you are so inclined :laughing:.

2 Likes