I want to self host Dagster for orchestration. Then I pull data from multiple sources and store this in a DuckLake on object storage.
Once a month, I need to parse a large XML to parquet. This is CPU bound. Would it be possible to spin up a fly machine for an hour once a month and do the calcultions?
Can I consider fly to be basically be serverless compute that I can spin up for bursty workloads?
Does anyone have experience with using Fly for data engineering work?
yeah, this is a good use of fly machines! I recommend creating a new machine for each run and using auto_destroy so the machine is automatically removed when it’s done with the monthly job, see the API documentation
AWS also has shared, or “burstable”, CPUs. I quickly looked up per-hour pricing for c5/m5 dedicated cpu instances and they seem comparable to our performance CPU pricing.
… although I would hope that one chooses Fly because it’s a better fit for their workload rather than because it’s cheaper than $competitor :)
Yes. Following @lillian’s advice, I’d build your pipeline as a Docker container so that when your process ends, the container dies naturally. You could trigger it from the scheduler in GitHub Actions (or other hosted CI).
So there is a little config to write here, but not much.