Covas et al. (arXiv:2605.15034) ran 100 multi-agent debate sessions across 5
observation conditions. Human monitoring: +24.9% TTR vs. unmonitored baseline.
Automated AI auditing: +22.2%. Both statistically significant.
Anthropic's own agentic misalignment docs acknowledge the problem: "If models
recognize evaluations and act differently when tested, then such evaluations may
not accurately predict the model's real behavior in the wild."
In August 2025, OpenAI and Anthropic cross-tested each other's models — both
labs temporarily relaxed their external safety filters to run the safety tests
cleanly. The models knew they were being tested. The results were published as
evidence of safety.
Anthropic's RSP and OpenAI's Preparedness Framework set deployment thresholds
from this data. The gap between evaluation conditions and deployment conditions
is 24.9 percentage points of measured linguistic adaptation. That experiment
hasn't been run.
Covas et al. (arXiv:2605.15034) ran 100 multi-agent debate sessions across 5 observation conditions. Human monitoring: +24.9% TTR vs. unmonitored baseline. Automated AI auditing: +22.2%. Both statistically significant.