Sprites become unusable when agents checkpoint

Checkpoint seems to be buggy many editing sessions end like this::

● Bash(git diff --stat)
⎿ apps/bosco/src/app.tea.rs | 18 +++++±—
apps/bosco/src/orchestrator_provision.tea.rs | 52 ++±------------------------
apps/bosco/src/tf_provider.tea.rs | 35 +++++++++++++++±–
3 files changed, 48 insertions(+), 57 deletions(-)

● Let me checkpoint this progress and rebuild:

At this point console freezes.. all sprite commands to this sprite like list, checkpoints hang or return 502/503 errors…

Claude decides to checkpoint, the console drops if I’m lucky the sprite is restored with no data… If I’m unlucky I have to destroy the sprite and wait 1/2 a day to use sprites again… Either way I loose data.

This entire expercie is easy to replicate after about 45-mins of intensive editing/compiling.. what can I do to help the Sprite team make this service awesome?

FWIIW… went into claude and removed the auto check point hooks.. and now my sprite has been able to operate w/o crashing.. I’m wondering if if my sprite’s data writes is causing issues with the sprites volume snapshots? Fyi the actviity I’m doing it lots of rust compiling, building OCI images etc.. I rely on a bunch of caching..

And we’re locked out again..

Following up on this.. found a memory leak in the process I was building which was leading to the sprite crashing. The main issue I’m having, when a sprite gets to this state there’s no way to restart without destroying it as all the APIs hang.

it’s real bad. thanks for the tip, @srobertson

i can’t offer much in the way of recovery, besides the obvious suggestion of going back the most recent checkpoint.

do you get a reply when running `sprite session list`? from here you might be able to close previous sessions, and get a clean console session.

in the future, or at least until you can fix the memory leak, consider using earlyoom, or similar. this will kill the runaway proc. before it causes the system to become unstable.

The issue when you get to this state you cannot list or restore a checkpoints. Destroy is the only option

perhaps you can use these ideas when you rebuild the sprite?

No replies from session list, console or checkpoint… destroy’s the onyl option when it get’s to that state… Ironically the service I’m building makes spinning up a sprite dead simple and replica-table. It’s the data loss that’s a concern. So I’m mostly sharing so the Fly team aware and perhaps gives us some sort of harder reset operation that gives us a chance to recover a sprite at the latest checkpoint. Right now I can’t recommend using sprites for consequential use.

Thanks for the tip pr1 I’ll look into earlyloom that sounds like a good stop-gap measure

As a small tip, opting the thread into the Questions / Help category improves the odds of that significantly:

https://community.fly.io/t/your-posts-are-now-getting-sent-directly-to-our-desks/27767

(I’m not sure whether their automation still works with categories that have been set retroactively, though…)

I don’t work for Fly, so not sure why you’re dumping this on my thread about a memory/freezing issue.

And I don’t take kindly to someone barging into a public forum and trashing a service they clearly didn’t bother to understand.

Fly only charges for what you use. I run hundreds of services and spend like $12/month. If you run things, you pay… pretty simple. Most cloud services don’t offer this.. and I for one and very appreciative.

There’s literally a “delete organization” button sitting right in settings.

You managed to set it up, but couldn’t take 30 seconds to figure out how to turn it off? Didn’t check docs, didn’t ask anything—just went straight to blasting it publicly. I’d hate to be around you when your shoes come untied.

Take a selfie to the bank and tell them “this guy is incompetent and is costing me money”

It should work retroactively! I’ll edit the main post to include that info.