🛡️ Safety
AI Alignment Forum
1 min read
Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes
It turns out that Anthropic accidentally trained against the chain of thought of Claude Mythos Preview in around 8% of training episodes. This is at least the second independent incident in which Anthropic accidentally exposed their model's CoT to the oversight signal. In more powerful systems, this kind of failure would jeopardize safely navigating the intelligence explosion. It's crucial to…