#benchmark

🍎 AI Labs Apple ML Research 2 min read

From Where Things Are to What They’re For: Benchmarking Spatial–Functional Intelligence for Multimodal LLMs

True spatial intelligence for multimodal agents transcends low-level geometric perception, evolving from knowing where things are to understanding what they are for. While existing benchmarks, such as VSI-Bench, effectively evaluate this foundational geometric stage, they fall short of probing the higher-order cognitive abilities essential for grounded intelligence. To bridge this gap, we introduce the Spatial-Functional Intelligence Benchmark (SFI-Bench), a video-based…

#multimodal llms #spatial reasoning #functional understanding

🕐 a day ago

Read →

ℹ️ News InfoQ AI 6 min read

Legare Kerrison and Cedric Clyburn on LLM Performance and Evaluations

Effectively measuring the performance of applications that are leveraging Large Language Models (LLM) is critical to the adoption of AI technologies in organizations. Legare Kerrison and Cedric Clyburn from RedHat…

#llm #performance evaluation #rag

🕐 8 days ago

Read →

🍎 AI Labs Apple ML Research 2 min read

Can Large Language Models Understand Context?

Understanding context is key to understanding human language, an ability which Large Language Models (LLMs) have been increasingly seen to demonstrate to an impressive extent. However, though the evaluation of…

#large language models #context understanding #benchmark

🕐 16 days ago

Read →

📥 Newsletters Import AI 13 min read

Import AI 453: Breaking AI agents; MirrorCode; and ten views on gradual disempowerment

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe. A shorter issue than…

#ai agents #code generation #benchmark

🕐 23 days ago

Read →

📥 Newsletters Import AI 16 min read

ImportAI 449: LLMs training other LLMs; 72B distributed training run; computer vision is harder than generative text

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe. Subscribe now Can LLMs…

#llm #post-training #ai-autonomy

🕐 a month ago

Read →

#benchmark — AI News & Research · DeepTrendLab

From Where Things Are to What They’re For: Benchmarking Spatial–Functional Intelligence for Multimodal LLMs

Legare Kerrison and Cedric Clyburn on LLM Performance and Evaluations

Can Large Language Models Understand Context?

Import AI 453: Breaking AI agents; MirrorCode; and ten views on gradual disempowerment

ImportAI 449: LLMs training other LLMs; 72B distributed training run; computer vision is harder than generative text