Claude AI interface showing 72.5% accuracy score on OSWorld benchmark after Vercept integration
← Back to AI đŸ€– AI: Computer Control

How Claude's Computer Use Accuracy Soared from 15% to 72.5% Following Anthropic's Strategic Vercept Acquisition

📅 March 23, 2026 ⏱ 6 min read ✍ GReverse Team

Anthropic just bought Vercept — a startup focused on AI-computer interaction. Same week, Claude Sonnet 4.6 posts a staggering computer use score: 72.5% accuracy on OSWorld, up from 14.9% in 2024. In 16 months, Claude's ability to control real computers increased five-fold.

This isn't incremental progress. When an AI can navigate complex spreadsheets, fill forms across browser tabs, and manage workflows spanning multiple tools, we're talking about a new chapter in automation. The question isn't whether AI will handle computer tasks — it's how fast we get to 90%.

📖 Read more: Anthropic Launches Cowork: Claude Code for Non-Programmers

🔬 From 15% to 72.5%: The Performance Jump That Changes Everything

OSWorld benchmark is the gold standard for measuring AI computer use capabilities. It tests how well models can execute real tasks on operating systems — clicking, typing, navigating applications like humans do.

The numbers tell a dramatic story:

14.9% Claude 3.5 Sonnet (October 2024)
28.0% Claude 3.7 Sonnet (February 2025)
72.5% Claude Sonnet 4.6 (February 2026)

For comparison, GPT-5.2 scores 38.2% on the same benchmark. Claude Opus 4.6 hits 72.7%, but the difference is negligible. The kicker? Sonnet costs around $2.60 per million tokens while Opus runs $13.

What 72.5% Accuracy Means in Practice

At this performance level, Claude Sonnet 4.6 can now:

  • Navigate complex Excel spreadsheets and execute calculations
  • Handle LibreOffice, Chrome, and VS Code like a human user
  • Manage workflows requiring coordination across multiple applications
  • Fill forms across different browser tabs seamlessly

We've crossed the threshold from "interesting demo" to "genuinely useful tool." At this rate of improvement, hitting 90%+ accuracy by end of 2026 looks not just possible, but likely.

⚡ The Vercept Acquisition: Strategic Timing

Anthropic's acquisition of Vercept wasn't random. The startup, founded by Kiana Ehsani, Luca Weihs, and Ross Girshick, focused on a specific problem: how AI systems can "see" and act within the same software we use daily.

What makes this acquisition even more significant is the timing. The announcement came days after Claude Sonnet 4.6's release — the model that achieved these record computer use scores.

Why Vercept Made the Difference: The team specialized in two critical problems: perception (how AI recognizes screen elements) and interaction (how it engages with them). This expertise aligns perfectly with Anthropic's computer use priorities.

The Path to Autonomy

This acquisition isn't just about talent. It's about vision: AI that can handle multi-step tasks in real applications, solving problems that can't be addressed with code alone.

Think about the possibilities: AI that writes code across entire repositories, synthesizes research from dozens of sources, and manages workflows spanning multiple tools and teams. This becomes reality in 2026.

📊 Sonnet 4.6: The Model That Ate the Premium Category

Here's where the story gets more interesting. Claude Sonnet 4.6 didn't just improve at computer use — it evolved across every domain.

79.6% SWE-bench Verified (Software Engineering)
58.3% ARC-AGI-2 (Abstract Reasoning)
97.8% MATH-500 (Mathematics)

What stands out isn't just raw performance — it's the price difference. On SWE-bench Verified, the benchmark for real-world software engineering, Sonnet 4.6 scores 79.6% versus Opus 4.6's 80.8%. The difference is just 1.2 percentage points, but the cost is five times lower.

The Abstract Reasoning Breakthrough

The ARC-AGI-2 benchmark result deserves special attention: from 13.6% (Sonnet 4.5) to 58.3% (Sonnet 4.6). That's a 4.3x improvement in one generation.

ARC-AGI-2 doesn't measure memorized knowledge — it tests abstract reasoning on novel problems. You can't "cheat" this benchmark with data copying. This jump indicates a core shift in Sonnet's reasoning architecture.

🎯 New Capabilities and Features

Beyond benchmarks, Claude Sonnet 4.6 brings innovative features that make it more practical for production use:

Adaptive Thinking

Dynamically adjusts reasoning depth based on question complexity. Simple queries get fast responses, complex problems deserve deep chain of thought.

Context Compaction

Intelligently compresses long conversation histories to maintain coherence within the 1M token window, without losing critical information.

Prompt Injection Resistance

Enhanced defenses against prompt injection attacks — critical for agent deployment where models process untrusted content.

These features solve the reliability problems that kept AI agents out of production systems.

Claude in Excel: The Next Big Integration

One of the most intriguing additions is native Excel integration. While details haven't been announced yet, the direction is clear: Claude wants to become an integral part of the tools we use daily.

This isn't another plugin — it's the next phase of integration that will make AI automation part of everyday work.

🔼 What This Means for Automation's Future

The tragedy of Claude Sonnet 4.6 is that it takes premium performance and makes it affordable. At $2.60 per million input tokens — versus $13 for Opus — it delivers 95-100% of flagship performance on coding, computer use, and agent tasks.

This is a pricing problem for every AI company with premium models. When users can't distinguish between the $2.60 model and the previous generation's $13 flagship — and actually prefer the cheaper one — you have a premium category problem.

The Computer Use Tragedy: The most important chart in AI right now is computer use evolution. Going from 14.9% to 72.5% in 16 months isn't incremental improvement — it's capability crossing the threshold from research curiosity to production deployment.

When Do We Hit 90%?

At this rate of evolution, 90%+ computer use scores by end of 2026 look not just feasible, but probable. That would mean AI capable of handling virtually any computer task with human-level accuracy.

When we reach that point, we won't be talking about "AI assistance" — we'll be discussing complete automation of tasks that today require human intervention.

The Vercept acquisition and Sonnet 4.6's jump to 72.5% aren't disconnected events. They're pieces of a strategy targeting complete computer task automation. 2026 might be the year AI computer use transitions from impressive demo to essential productivity tool.

Claude AI Anthropic computer automation AI benchmarks OSWorld Vercept acquisition desktop AI Claude Sonnet

Sources: