#benchmarking

🤗 AI Labs Hugging Face Blog 6 min read

Adding Benchmaxxer Repellant to the Open ASR Leaderboard

#asr #benchmarking #leaderboard

🕐 a day ago

Read →

🤗 AI Labs Hugging Face Blog 19 min read

AI evals are becoming the new compute bottleneck

#ai-evaluation #benchmarking #computational-cost

🕐 7 days ago

Read →

🤗 AI Labs Hugging Face Blog 8 min read

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

#arabic #llm #benchmarking

🕐 15 days ago

Read →

🤗 AI Labs Hugging Face Blog 15 min read

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

#ai-agents #benchmarking #tool-use

🕐 21 days ago

Read →

🤗 AI Labs Hugging Face Blog 10 min read

A New Framework for Evaluating Voice Agents (EVA)

#voice agents #evaluation framework #conversational ai

🕐 a month ago

Read →

🐍 Newsletters AI Snake Oil 10 min read

New Paper: Towards a science of AI agent reliability

Quantifying the capability-reliability gap

#ai agents #reliability measurement #ai safety

🕐 2 months ago

Read →

📥 Newsletters Import AI 12 min read

Import AI 446: Nuclear LLMs; China’s big AI benchmark; measurement and AI policy

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe. Subscribe now Want to…

#ai measurement #ai policy #governance

🕐 2 months ago

Read →

🐍 Newsletters AI Snake Oil 6 min read

Can AI automate computational reproducibility?

A new benchmark to measure the impact of AI on improving science

#ai #reproducibility #scientific research

🕐 1 year, 7 months ago

Read →

#benchmarking — AI News & Research · DeepTrendLab