Phoenix.new Feedback

I want to start by saying that I love the idea of phoenix.new and I think it works very well at creating working mock-ups quickly. It obviously understands LiveView and Presence and other Phoenix-framework-specific concepts better than other models/tools. But what I’m getting for $20/mo feels severely underwhelming compared to what most other AI tools offer for similar price points.

For context, I’ve been developing software across a variety of industries and programming languages for almost 20 years, and I’ve worked at AI companies before, and I’ve worked at companies that were very AI-positive (i.e. encouraging/supporting developers to use AI tooling). I also have been using Claude Code for a while now, to surprisingly good results even within Elixir/Phoenix projects.

The following are some pain points that, as a whole, really stop me from considering phoenix.new as a long-term AI coding tool. I’ll typically use Claude Code as a comparison as that’s probably the best personal experience I’ve recently had for this kind of tooling.

Rewrites vs Edits

If I make changes to a file, phoenix.new seems to always rewrite a file it needs to modify rather than seeking to edit the file. This means I frequently have my changes erased and overwritten.

If someone is using phoenix.new solely to vibe code from start to finish, that’s not a problem.

If someone is using phoenix.new just to set up a new project and get some working bits and bobs before taking it over manually, that’s probably not a big deal.

If someone is trying to work with phoenix.new to shape the direction of the code and pick up the slack when it fails to materialize a proper solution, then that is a huge issue for it to just throw out those manual changes.

Bad at Validation

I’ve seen some other complaints about this and I want to reiterate it: when phoenix.new validates web components, it seems like it frequently fails to do so correctly. In what feels like it happens half of the time, a validation task will just hang.

Whatever the cause may be, it appears as though phoenix.new prefers to use command-line tools (like a headless browser) in an effort to fetch a web page and inspect the contents and then interact with it. That’s nice for some things, but why isn’t it generating test cases in test/? Or why isn’t it using Elixir script (.exs files) to validate things?

Without any kind of prompting or guidance, even Claude Code defaulted to creating temporary .exs scripts to validate that functions and features were behaving as expected. It even created test cases in test/ to ensure that web components were showing the appropriate elements and that interactive elements were triggering the appropriate events. How is phoenix.new not doing that by default?

I’ve also frustratingly had some cases where phoenix.new will run what it believes is a good validation against something I’m telling it is incorrect, and then tell me everything is working as expected now, but it’s wrong. The validation it wrote had a flaw, usually in the premise of what I said the issue was, and that just wastes credits and time. Claude Code very rarely gets the premise of an issue wrong, and the problem it usually has is that it will either (a) fail to identify why there’s an issue or (b) get into a fix-debug loop where it literally can’t solve the issue and will get stuck trying indefinitely if I don’t step in.

But I prefer that to what I frequently experience with validations made by phoenix.new.

Lower Quality, Less Output

In my experience, phoenix.new starts off way better than any competitor. With most AI tools, it’s best to start with at least an “empty” project. With any greenfield project I might use Claude Code on, I’ll always do some initial setup first (e.g. mix phx.new) and then build context around that before engaging with the agent. But with phoenix.new, it’ll do all of that for you. It’ll run other mix commands too, like mix phx.gen.auth or mix phx.gen.live or whatever.

I also love that it starts by generating a plan that it uses to direct itself on what actions it needs to take to get an MVP (minimal viable prototype) completed. And by the end of the very first prompt response you will have a working Phoenix app. That’s incredible. You can put in very little effort overall; just describe the kind of application it should be, boom here’s something you can immediately start working with.

But after maybe 30 minutes, you can start to see where the quality drops off. And as the quality drops off, you start realizing how little output you are actually receiving for the cost.

In my case, it usually fills the initial app with a lot of static data. This is great for doing a mock-up, but phoenix.new is a software entity that can do way more than I can in a shorter amount of time. It can easily fill up a database with records to allowing displaying live data. There’s no need to resort to static data that breaks functionality when you try to have meaningful interactions with the application.

So eventually I end up needing to direct phoenix.new to removing static data and filling it with real data in the database. And while I could maybe get around this by explicitly requesting it to not use static data when I give an initial prompt, that’s extra work on my end that I don’t need to worry about with Claude Code or other tools I’ve tried so far. Claude Code frequently creates data in the database, even just temporarily sometimes, when it wants to validate something or prepare a particular view for further testing and development.

In short, the longer I spend with phoenix.new, the lower the quality of the output seems to be, and therefore the less output I feel as though I’m getting for the cost. Add in frustrations like the above Rewrites vs Edits, where phoenix.new will typically nuke my own changes and ignore the direction I wanted for the application, and it’s not hard to see how one could spend a large amount of time and money just trying to get the little details correct.

Significantly Higher Cost

Although $20/mo is a relatively low expense for a tool of this kind, it’s also about how much you might pay to Cursor for access to their custom LLM, or to Anthropic for a Claude Pro plan which gets you access to Claude Code in your terminal. And I get waaaaaaay more output, usually of higher quality as well, from those tools.

With Claude Code specifically, you have a rate limit that resets every 5 hours. Even just on the Pro plan (for $20/mo), I’ve never hit the limit once. But if I did, I wouldn’t have to wait days or weeks to get back into my workflow, I might have to wait a few hours at most.

With phoenix.new, I’m hyper aware that I’m running out of credits. Not only because the credits are reset monthly and running out of credits has a larger impact on my workflow, but because the quality of the output forces me to engage more often in ways I wish I didn’t have to. I’m spending more money that I feel I should be just trying to course-correct phoenix.new onto a path that other tools don’t seem to have nearly as much trouble with.

TL;DR

Overall I loved the initial experience with phoenix.new but it really started to fall off a cliff after about an hour or so. I think there’s a ton of potential here, but I’m also sad to see things like the lack of using .exs scripts for testing or data generation, or the lack of creating tests in test/ without being explicitly told to. I really don’t like how files get overwritten and my personal work is effectively thrown out every time it happens, effectively taking one step forward and two steps back. And when all the various issues are taken together, it’s really hard to generate a good value proposition over something like Claude Pro + Claude Code.

But I do think the potential is there, and I’m hoping it’ll see a lot of love in the coming days as it improves its capabilities.

2 Likes

I also liked the idea a lot but I ended up just losing time and money.
First, the agent got stuck writing a session_live.ex file. No matter what I did, it kept on rewriting the module by appending its definition at the end of the file. This resulted in a 5000 long file for a simple live file. That alone costed $10 and it kept on repeating the same mistake no matter what I prompted or which changes I made to the file itself in the editor.

Then, when at $0 credits left, I was unable to clone the repository because of a remote error:

Cloning into 'agent_observer'...
fatal: the remote end hung up unexpectedly

Not only it’s a waste of $20 but also a huge waste of time given that I can’t even get the code on my machine.
Really disappointed

Sorry to hear about your less than stellar experience. We’ve have seen some corrupted file system issues that can cause beamfile errors, and for which the agent will repeatedly attempt to fix because every compile that should work gives an beamfile error. I’m not sure if this is what happened in your case, but I’ve thrown some credits your way if we want to give things another go.

Hey @the-eon, thanks for the candid feedback. I also threw some credits your way if you want to try out a feature that just shipped, which will help with some of the behavior you want:

Testing/validation:
We now support a root /workspace/AGENTS.md, where you can specify guidelines you want the agent to follow. For example, we don’t instruct the agent to write test files unless the user asks for it. So if you prefer less web “testing” in favor of test files, you can either ask explicitly, or now with AGENTS.md, just specify you always want test files and do avoid calling the web browser to interact and try out the app.

Rewrites vs Edits:
The agent has a surgical edit tool that it can choose for minor edits, but it’s currently a balance of successful surgical modification vs a bad edit that requires a rewrite anyway. We’re tuned with a happy balance at the moment, but if you’d like to have the agent lean on the surgical modify front, your AGENTS.md can say something like Prefer surgical modify for any file edits that can be reasonably accomplished with standard sed operations or similar. It may take some trial and error.

Strong start, tapered off quality:

In my experience, phoenix.new starts off way better than any competitor.

Glad to hear we are giving a strong start! At the moment we’ve only focused on and optimized for the starting point. Once you’re ready to iterate on an established (or existing) app, the agent takes a lot more hand holding and direction to get the more measured behavior that you want. AGENTS.md can help here, but really we need to focus our internal guidelines for these different modes. Currently it’s full vibe mode, but we are working on an pair mode to make this continued dev story much better. So stay tuned for updates in this regard. Thanks!

–Chris

1 Like

Thanks a lot for the quick reply and the credits. It was really frustrating to see it stuck appending the same module over and over again while the credits where sinking :sweat_smile:.

I think there’s a lot of potential there, but so far it’s too expensive compared to claude code and I ended up having to use a ssh key to push the repository manually on my github which worked well. Maybe an integration with github would prevent this sort of issues

Oh wow, thanks for jumping into the conversation so quickly @chrismccord. Yeah, I think the main takeaway if you’re contemplating what would bring the most value to phoenix.new is that ability to do more “pair programming” workflow types. I haven’t tried using AGENTS.md to ask it to use surgical modifications instead of the current “hammer solves everything” approach, so I’ll try doing that today and see what happens.

I think if you distilled my feedback into a single primary issue, it’s two parts:

  1. For a purely vibe coding tool, it consumes a lot of credits and progressively gets “more expensive” the further along you get, i.e. you get less bang for your buck as it stops making these huge leaps in progress for relatively few credits.
  2. When moving beyond pure vibe-coding, it has to be a lot better at working with the user to create solutions. Even if it’s going to completely rewrite a file, it should at least read in any recent edits made to understand what the user has done and try to use that as additional context when rewriting the file.

Thanks again for the quick feedback, I’m looking forward to seeing how surgical edits work out and I’ll be keeping an eye on future updates to the tool!