Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols Paper • 2510.09462 • Published Oct 10 • 5
Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols Paper • 2510.09462 • Published Oct 10 • 5
Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols Paper • 2510.09462 • Published Oct 10 • 5 • 2
DISCO: Diversifying Sample Condensation for Efficient Model Evaluation Paper • 2510.07959 • Published Oct 9 • 14
D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models Paper • 2509.17938 • Published Sep 22 • 4
Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLM Paper • 2509.18058 • Published Sep 22 • 12
Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLM Paper • 2509.18058 • Published Sep 22 • 12