I Built The Same App with Every LLM
The Signal
A DIY benchmark comparing top AI coding agents reveals a divergence between model speed and final build quality. By tasking models with building a Mario Kart-style game under constrained settings, the creator found that models which quickly output mass code often fail to produce a functional result. While Claude won the overall verdict for build quality, it was significantly slower than alternatives like GPT 5.5, which provided a more usable experience with less compute time.
The Case
- Claude delivered the creator's top-ranked overall result but took 23 minutes and 18 seconds to finish, notably waiting roughly seven minutes before starting any code.
- GPT 5.5 emerged as the clearest usable build during the first pass, offering playable movement and collision despite failing to incorporate any of the provided project assets.
- opus 4.8 failed completely on the first prompt, producing a blank green screen, yet it became the most feature-complete result after one vague follow-up prompt.
- Kimi 2.5 proved remarkably fast and prolific in terms of code volume, yet it remained largely unresponsive to user input, rendering the massive output functionally worthless.
- Grok was presented as the worst value proposition, taking 21 minutes to generate a broken game that allowed the player to drive through walls.
- The creator admits the benchmark is intentionally constrained and non-scientific, explicitly stating the results apply only to this specific task rather than representing universal model capabilities.
The 1 Minute Signal Take
This benchmark is a useful reality check on AI coding agents, demonstrating that high output volume and fast runtime are often decoupled from logical success. The creator’s final verdict favors the slow, meticulous build over the quick-and-dirty approach, but the disparity in asset handling suggests these models are still brittle when dealing with specific project requirements. Skip the video if you only need the ranking; watch it if you want to see the specific failure modes, such as the 'green screen' initialization errors, that differentiate a truly functional model from a speed-optimized one.
Time saved:
