Fly for data pipelines?

Is Fly a good fit for running data ELT pipelines?

I want to self host Dagster for orchestration. Then I pull data from multiple sources and store this in a DuckLake on object storage.

Once a month, I need to parse a large XML to parquet. This is CPU bound. Would it be possible to spin up a fly machine for an hour once a month and do the calcultions?

Can I consider fly to be basically be serverless compute that I can spin up for bursty workloads?

Does anyone have experience with using Fly for data engineering work?

You’ll need to use the performance cpu, which can get expensive. I believe AWS is the cheapest solution.

yeah, this is a good use of fly machines! I recommend creating a new machine for each run and using auto_destroy so the machine is automatically removed when it’s done with the monthly job, see the API documentation

AWS also has shared, or “burstable”, CPUs. I quickly looked up per-hour pricing for c5/m5 dedicated cpu instances and they seem comparable to our performance CPU pricing.

… although I would hope that one chooses Fly because it’s a better fit for their workload rather than because it’s cheaper than $competitor :​)

Yes. Following @lillian’s advice, I’d build your pipeline as a Docker container so that when your process ends, the container dies naturally. You could trigger it from the scheduler in GitHub Actions (or other hosted CI).

So there is a little config to write here, but not much.