🤗 AI Labs
Hugging Face Blog
6 min read
#benchmarking
8 articles
🤗 AI Labs
Hugging Face Blog
19 min read
AI evals are becoming the new compute bottleneck
🤗 AI Labs
Hugging Face Blog
8 min read
QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard
🤗 AI Labs
Hugging Face Blog
15 min read
Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents
🤗 AI Labs
Hugging Face Blog
10 min read
A New Framework for Evaluating Voice Agents (EVA)
🐍 Newsletters
AI Snake Oil
10 min read
New Paper: Towards a science of AI agent reliability
Quantifying the capability-reliability gap
📥 Newsletters
Import AI
12 min read
Import AI 446: Nuclear LLMs; China’s big AI benchmark; measurement and AI policy
Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe. Subscribe now Want to…
🐍 Newsletters
AI Snake Oil
6 min read
Can AI automate computational reproducibility?
A new benchmark to measure the impact of AI on improving science