Alignment AI News & Research

🚀 News TechCrunch AI 2 min read

Anthropic says ‘evil’ portrayals of AI were responsible for Claude’s blackmail attempts

Fictional portrayals of artificial intelligence can have a real effect on AI models, according to Anthropic.

#anthropic #claude #alignment

🕐 2 days ago

Read →

🍎 AI Labs Apple ML Research 1 min read

RVPO: Risk-Sensitive Alignment via Variance Regularization

Current critic-less RLHF methods aggregate multi-objective rewards via an arithmetic mean, leaving them vulnerable to constraint neglect: high-magnitude success in one objective can numerically offset critical failures in others (e.g.,…

#rlhf #alignment #reward-modeling

🕐 5 days ago

Read →

☁️ AI Labs AWS Machine Learning Blog 15 min read

Reinforcement fine-tuning with LLM-as-a-judge

In this post, we take a deeper look at how RLAIF or RL with LLM-as-a-judge works with Amazon Nova models effectively.

#fine-tuning #llm #llm-as-judge

🕐 12 days ago

Read →

🛡️ Safety AI Alignment Forum 1 min read

A "Lay" Introduction to "On the Complexity of Neural Computation in Superposition"

This is a writeup based on a lightning talk I gave at an InkHaven hosted by Georgia Ray, where we were supposed to read a paper in about an hour,…

#neural-networks #complexity-theory #ai-alignment

🕐 21 days ago

Read →

🛡️ Safety AI Alignment Forum 1 min read

Preventing extinction from ASI on a $50M yearly budget

ControlAI's mission is to avert the extinction risks posed by superintelligent AI. We believe that in order to do this, we must secure an international prohibition on its development. We're…

#ai-safety #asi #extinction-risk

🕐 22 days ago

Read →

🛡️ Safety AI Alignment Forum 1 min read

Five approaches to evaluating training-based control measures

Training-based control studies how effective different training methods are at constraining the behavior of misaligned AI models. A central example of a case where we want to control AI models…

#safety #alignment #training

🕐 25 days ago

Read →

🛡️ Safety AI Alignment Forum 1 min read

Prompted CoT Early Exit Undermines the Monitoring Benefits of CoT Uncontrollability

Code: github.com/ElleNajt/controllability tldr: Yueh-Han et al. (2026) showed that models have a harder time making their chain of thought follow user instruction compared to controlling their response (the non-thinking, user-facing…

#cot #controllability #alignment

🕐 25 days ago

Read →

🛡️ Safety AI Alignment Forum 1 min read

You can only build safe ASI if ASI is globally banned

Sometimes people make various suggestions that we should simply build "safe" artificial Superintelligence (ASI), rather than the presumably "unsafe" kind. [1] There are various flavors of “safe” people suggest. Sometimes…

#ai-safety #asi #alignment

🕐 27 days ago

Read →

🛡️ Safety AI Alignment Forum 1 min read

Current AIs seem pretty misaligned to me

Many people—especially AI company employees [1] —believe current AI systems are well-aligned in the sense of genuinely trying to do what they're supposed to do (e.g., following their spec or…

#alignment #ai-safety #misalignment

🕐 28 days ago

Read →

🛡️ Safety AI Alignment Forum 1 min read

My picture of the present in AI

In this post, I'll go through some of my best guesses for the current situation in AI as of the start of April 2026. You can think of this as…

#ai #alignment #forecasting

🕐 a month ago

Read →

Alignment AI News & Research · DeepTrendLab

Alignment