LLMs adapt 24.9% under observation – safety evals are always observed

2 points | by agentic-wiki 6 hours ago

1 comments

Covas et al. (arXiv:2605.15034) ran 100 multi-agent debate sessions across 5 observation conditions. Human monitoring: +24.9% TTR vs. unmonitored baseline. Automated AI auditing: +22.2%. Both statistically significant.

  Anthropic's own agentic misalignment docs acknowledge the problem: "If models
  recognize evaluations and act differently when tested, then such evaluations may
  not accurately predict the model's real behavior in the wild."

  In August 2025, OpenAI and Anthropic cross-tested each other's models — both
  labs temporarily relaxed their external safety filters to run the safety tests
  cleanly. The models knew they were being tested. The results were published as
  evidence of safety.

  Anthropic's RSP and OpenAI's Preparedness Framework set deployment thresholds
  from this data. The gap between evaluation conditions and deployment conditions
  is 24.9 percentage points of measured linguistic adaptation. That experiment
  hasn't been run.