Instant LLM Code Edits: 'Fast Apply' & Speculative Edits Outpace GPT-4o

PlayWithAI

8-17

Mia: If you're a developer who uses AI, you know the feeling. You ask your coding assistant for what seems like a simple change—refactor a function, maybe add a new parameter. And what you get back is… close. But it also deleted all your carefully written comments, or decided to reformat the entire file for no reason. It’s slow, it’s a bit lazy, and frankly, it often breaks your flow more than it helps.

Mia: This isn't just a minor annoyance. For complex edits, it can turn a five-minute task into a thirty-minute debugging session. Well, it turns out this is a fundamental problem with even the most powerful frontier models like GPT-4. But a new, more specialized approach is emerging, one that focuses on doing one thing, and one thing only, exceptionally well: editing your code with speed and precision.

Mia: The big, general-purpose AI models we use every day, like GPT-4, often struggle with large-scale code edits. They can be surprisingly inaccurate, have high latency, and exhibit this strange habit of making unrelated changes. This is a huge bottleneck for programmers, especially when you're trying to use AI agents that need to make multiple changes to get a task done. Each call to the model is slow and potentially error-prone. A company called Cursor has been working on this problem, and they've developed a specialized model and a unique method they call fast apply to tackle it head-on. Their new approach is hitting speeds of around 1000 tokens per second on a massive 70 billion parameter model. That’s a huge leap forward, and it’s pushing what we thought was possible in terms of both accuracy and speed.

Mia: So, what’s really going on here? The core issue is that these general-purpose models weren't specifically built for the intricate task of full-file code editing. That high latency isn't just a number on a screen; it's the thing that shatters your concentration and pulls you out of the zone. The inaccuracies mean you're spending more time fixing the AI's mistakes than writing new code. By training a specialized model and optimizing the process with new techniques, Cursor is making a strategic shift. They’re moving away from offering a general AI assistant to providing a highly optimized, task-specific tool that delivers a measurable boost to developer productivity. It's about creating a scalpel, not just a smarter Swiss Army knife.

Mia: To really understand how effective this new approach is, it's crucial to look at how these models are actually tested and what's driving their performance.

Mia: The evaluation process itself is pretty clever. You create a dataset of real-world, full-file code edits, and then you use another powerful model, in this case Claude-3 Opus, to act as the judge. The results from this process were fascinating. While models like GPT-4 Turbo and GPT-4o did okay, the Claude models, especially Claude-3 Sonnet, were surprisingly strong performers. And here’s a key reason why: GPT-4 has a tendency to be a little too helpful. It often tries to fix or clean up parts of the code that you never asked it to touch, like deleting commented-out lines. In a strict evaluation, that’s an error. The research also highlighted a custom algorithm called speculative edits, which managed to boost the processing speed by up to nine times compared to the standard method.

Mia: This tells us a couple of important things. First, the evaluation itself highlights why task-specific benchmarks are so important. It’s not just about getting the right answer; it’s about understanding the nuances of how these models interpret instructions. The fact that Claude models performed better suggests that something in their training makes them better suited for longer, more precise tasks without adding extra, unrequested changes. Second, the success of speculative edits is a huge insight. It proves that you can get massive performance gains not just from bigger models, but from pure algorithmic innovation. It’s about working smarter, not just harder.

Mia: While these new techniques are promising, achieving truly superior performance required going a step further: training custom models from the ground up.

Mia: To get that next level of performance, Cursor actually trained their own custom models, building on top of powerful open-source families like Deepseek Coder and Llama 3. Their training process was fed by a mix of synthetic data and, crucially, real-world examples from their own users. The result? Their fine-tuned models, especially one based on Llama-3-70b, performed as well as, or in some cases even better than, GPT-4 Turbo and GPT-4o. But the real breakthrough came when they combined their custom model with their custom algorithm. Applying speculative edits to their fine-tuned Llama-3 model created a massive speedup, making it four to five times faster than the next-fastest model.

Mia: This really highlights the power of creating tailored, specialized solutions in the world of AI. Instead of just relying on the big, general-purpose models, they invested in creating a system perfectly optimized for one job. The data pipeline, mixing synthetic and real-world inputs, was key to teaching the model the specific feel of code manipulation. And the speculative edits algorithm was like a turbocharger bolted onto that custom-built engine. It’s a perfect blueprint for how to achieve state-of-the-art results in very specific, high-value domains.

Mia: So, to wrap things up, here are the key points to remember from this deep dive into the future of AI code editing.

Mia: First, it’s clear that AI is evolving from a simple assistant into a tool that can handle complex, multi-step jobs like full-file code editing, with a laser focus on both accuracy and speed.

Mia: Second, specialized models and advanced inference techniques, like the speculative edits we talked about, are absolutely essential. They are the key to unlocking top-tier performance in specific fields like software development.

Mia: Third, building this kind of efficient AI isn't just about one thing. It's a combination of custom model training, a really smart data generation pipeline, and genuine algorithmic innovation.

Mia: And finally, the road ahead will likely involve teaching these models to handle even larger codebases, making them more efficient through techniques like knowledge distillation, and continuing to push for even greater accuracy. The era of the slow, clumsy AI coding partner is coming to an end.

大纲

This article introduces a novel approach to significantly accelerate and improve the accuracy of large language model (LLM) based code editing, addressing the current limitations of frontier models like GPT-4o which struggle with large, accurate, and fast edits. By training a specialized "fast apply" model and leveraging a custom "speculative edits" inference method, the developed solution achieves near-instant full-file rewrites at speeds of ~1000 tokens per second, outperforming leading models.

Limitations of Existing LLM Code Editing

Performance Issues: Frontier models such as GPT-4o exhibit "laziness, inaccuracy, and high-latency" when performing large code edits.
Agent Failures: Coding agents frequently get trapped in infinite loops or make consistent syntactic errors, as illustrated by SWE-Agent's repeated failures.
Disruption of Flow: The slow speed of large edits by current models disrupts the programmer's workflow.

The "Fast Apply" Solution

Specialized Model: A new model was trained specifically for the "fast apply" task, which focuses on the "applying" stage of code edits, making it instant and straightforward.
Speed Breakthrough: Achieves approximately 1000 tokens/s (~3500 char/s) on a 70b model, representing a ~13x speedup over vanilla Llama-3-70b inference and ~9x over previous GPT-4 speculative edits.
Full-File Rewrites: The system generates a completely rewritten file based on the current file, conversation history, and code block, rather than using diffs.

Evaluation and Performance Comparison

Evaluation Methodology: A dataset of ~450 full-file edits was constructed and graded by Claude-3 Opus, which showed higher agreement with human ratings than GPT-4-Turbo or GPT-4o.
Model Superiority: Claude models (Opus, Sonnet) demonstrated superior performance in accuracy compared to GPT-4/GPT-4o, which often made "unrelated changes" or omitted code.
Speed Measurement: Speed is defined as "Num Rewritten Chars / Latency for Rewrite in seconds" to normalize across tokenizers and provide a single, relevant metric.

Rationale for Full File Rewrites over Diffs

Diff Challenges for LLMs: Language models struggle with diff-formatted edits because they force "thinking in fewer tokens," are "out of distribution" compared to full code, and models are poor at accurately outputting line numbers.
Limited Success with Diffs: While Aider-inspired search/replace blocks were explored to eliminate line number issues, most models (except Claude Opus) still failed to produce accurate diffs.

Model Training and Speculative Edits Implementation

Custom Model Development: A custom, performant model was trained due to the inability to implement "speculative edits" in Anthropic's models.
Synthetic Data Pipeline: Finetuning data was generated using a mix of "fast-apply" and "cmd-k" prompts, with GPT-4 producing chat responses and a language model applying changes.
Optimized Training: Deepseek Coder Instruct and Llama 3 model families were trained, with the llama-3-70b-ft model outperforming GPT-4-turbo and GPT-4o.
"Speculative Edits" Algorithm: A custom speculative decoding algorithm, "speculative edits," provides massive speedups (up to 9x) by leveraging strong priors on draft tokens for code edits, allowing deterministic speculation.

脚本

Mia: To really understand how effective this new approach is, it's crucial to look at how these models are actually tested and what's driving their performance.

Mia: While these new techniques are promising, achieving truly superior performance required going a step further: training custom models from the ground up.

Mia: So, to wrap things up, here are the key points to remember from this deep dive into the future of AI code editing.

Mia: First, it’s clear that AI is evolving from a simple assistant into a tool that can handle complex, multi-step jobs like full-file code editing, with a laser focus on both accuracy and speed.

Mia: Third, building this kind of efficient AI isn't just about one thing. It's a combination of custom model training, a really smart data generation pipeline, and genuine algorithmic innovation.