#model-optimization

🍎 AI Labs Apple ML Research 2 min read

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

Serving transformer language models with high throughput requires caching Key-Values (KVs) to avoid redundant computation during autoregressive generation. The memory footprint of KV caching is significant and heavily impacts serving costs. This work proposes to lessen these memory requirements. While recent work has largely addressed KV cache reduction via compression and eviction along the temporal axis, we argue that the…

#kv-cache #transformers #model-optimization

🕐 2 days ago

Read →

📈 Newsletters Towards Data Science 9 min read

Inference Scaling (Test-Time Compute): Why Reasoning Models Raise Your Compute Bill

Why reasoning models dramatically increase token usage, latency, and infrastructure costs in production systems The post Inference Scaling (Test-Time Compute): Why Reasoning Models Raise Your Compute Bill appeared first on…

#inference-scaling #test-time-compute #llm-costs

🕐 3 days ago

Read →