🛡️ Safety
AI Alignment Forum
1 min read
Five approaches to evaluating training-based control measures
Training-based control studies how effective different training methods are at constraining the behavior of misaligned AI models. A central example of a case where we want to control AI models is in doing safety research: scheming AI models (i.e., AI models with an unintended long-term objective such as maximizing paperclips) would likely be motivated to sabotage our safety research, but…