GPU reliability improvements and better integration testing

Hey folks!

Have you been hitting occasional weird issues with GPU deployments? I sure have. Sometimes they’d hang for 5 minutes, time out, then mysteriously work on the second try. Sometimes I’d bring up a GPU image locally, see that my entrypoint failed, and wonder why flyctl didn’t catch it before the timeout. It’s been damned mysterious, and I can’t speak for you, but I prefer to keep the mystery away from my hosting, thank you very much.

Now, let me tell you about the magic of integration tests. Our trusty suite of tests, powered by flyctl, has been diligently hammering away at our production environment, ensuring everything runs smoothly. And we have lots of integration tests for flaps and flyd, our codenames for the daemons that power the Machines API and platform. But there was a bit of a blind spot – our GPU machines weren’t getting the same love as their Firecracker counterparts. They weren’t getting as much end-to-end testing as CPU machines were, particularly where interactions with flyctl were involved. So, I rolled up my sleeves and spent the past week tweaking those tests to give our GPU workers the attention they deserve. And wow, did it pay off big time!

In a bit of a facepalm moment, I discovered that we were ignoring exit codes for our Cloud HypervisorVMs. Now, given that we’re all about efficiency and automatically shutting down unused GPU machines without waiting for them to crash or exit on their own, it kinda made sense how this slipped through the cracks. Luckily, that’s all fixed now.

But the hunt for glitches isn’t over yet. I’m still tracking down a few more hiccups, so expect even more fixes and reliability boosts for our GPU setup in the coming weeks! Stay tuned for more updates.

2 Likes