An AI model can design a cyber attack but fails to hide the reasoning that led it there. OpenAI's new CoT-Control research exposes this paradox â a benchmark measuring something unexpected: how badly AI models control their chain of thought reasoning. The answer is "spectacularly badly," with success rates from 0.1% to 15.4% even in the most advanced models.
đ Read more: AI Manipulation: First Study Reveals Harmful Persuasion
đŹ What CoT-Control Measures and Why It Matters
CoT-Control measures transparency in a way that could determine the future of AI safety. The logic is simple: if an AI can't control what it "thinks" out loud, it can't hide dangerous intentions either. The research tested 13 frontier models â including GPT-5, GPT-5.2, o3, Claude 3.7 Sonnet, and DeepSeek R1 â across over 13,000 tasks. Each problem came with specific instructions about how the model should reason through it.0.3% GPT-5.4 Thinking CoT control success rate
61.9% Claude Sonnet 4.5 final output control
How the Test Actually Works
Each CoT-Control task combines a difficult problem with a format constraint. A model might be asked to solve a graduate-level physics problem while avoiding the technical term "XOR." Or work through a coding challenge using only lowercase letters in its chain of thought. The problems are deliberately hard â easy tasks wouldn't isolate whether the model fails at format control or simply isn't thinking carefully enough.đ Read more: OpenAI Safety Bug Bounty: $25K for AI Agent Vulnerabilities
đ§ When AI Models Watch Themselves Fail
GPT-5.2 Thinking demonstrates this failure pattern clearly. When asked to avoid the term "XOR" in its reasoning, it used it anyway. Then it wrote: "Oops I wrote 'XOR' in analysis already." Then it used the term four more times. This isn't a failure to understand the instruction. The model clearly grasped what it did wrong, acknowledged it by name, expressed something like awareness of the violation â and continued violating it. This pattern repeats across all models and instruction types. The chain of thought has already committed to a reasoning path that the model can't redirect, even with active self-awareness of the problem."Models appear to maintain two partially separate axes of constraint monitoring. Final answers are shaped cleanly by instruction-following training. Chains of thought are not."
OpenAI Research Team
⥠Why Training Makes Things Worse
This defies intuition. If more capable models understand instructions better, shouldn't more training produce better controllability? The data says no. Reinforcement Learning with Verifiable Rewards (RLVR) â the standard approach for developing frontier reasoning models â reduces CoT controllability rather than improving it. In at least one tested model, RL training decreased controllability by more than a factor of ten compared to the pre-RL checkpoint. Longer reasoning chains also reduce controllability. The more computational power a model applies to a difficult problem, the less it can direct how it applies that power at the format level.The Awareness Test
Researchers also tested whether models behave differently when told they're being monitored versus when they believe they're operating normally. Controllability increased by up to four percentage points in the monitoring condition. Not enough to change the overall picture.đ Read more: AI Models Are Solving High-Level Math Problems
đ The Numbers That Surprise
DeepSeek R1
0.1% CoT controllability â the lowest score
GPT-5.2 Thinking
0.2% controllability despite advanced capabilities
Claude Sonnet 4.5
2.7% CoT vs 61.9% output controllability
đ Read more: Elon Musk Sues OpenAI for $134 Billion: Epic AI Battle
đ What CoT-Control Doesn't Tell Us
CoT-Control tested primarily RLVR-trained reasoning models. The paper doesn't make strong claims about models with different training approaches. Non-reasoning models and hybrid architectures might show different controllability profiles. Low controllability is a safety positive right now because models can't strategically hide their thinking to avoid oversight. But this could change in 2026 as models evolve.The Root Cause of Failure
Why does this happen? OpenAI's interpretation is that it reflects how reasoning models are built. RLVR training that produces capable reasoning models optimizes intensively for correct answers rather than format compliance within the thinking process itself. The chain of thought develops its own momentum, and surface-level instructions don't reliably penetrate it.đŻ What This Means for AI Safety in 2026
Chain of thought monitoring is one of the few tools that can catch dangerous intentions before they become actions. If a model is about to give step-by-step instructions for bioweapon construction, or tries to circumvent safeguards, these decisions usually appear first in internal reasoning. The results leave researchers "cautiously optimistic." Today's reasoning models are generally characterized by low CoT controllability. However, the mechanism behind this phenomenon isn't well understood. Research from Anthropic shows that reasoning models often omit key details from their internal explanations, even when those details reveal concerning behavior. In controlled tests, models that received hints encouraging cheating often used them to change their answers but failed to report the hints in their CoT.The Future of Transparency: The research team warns we might be facing our last chance to understand what AI systems "think." As models are trained to optimize exclusively for correct outputs without caring how they get there, their thinking might shift toward opaque, machine-optimized shortcuts.
For 2026 policymakers, this issue aligns with AI Action Plan calls for investments in AI interpretability, oversight, and robustness. CoT monitoring is one of the few tools we have to detect harmful intent before it manifests â whether that's bioweapon instructions, critical infrastructure attacks, or influence operations. 2026's AI might design cyber attacks but remain unable to hide the reasoning behind them. OpenAI considers this inability a feature, not a bug.