Search

Home > Linear Digressions > Benchmarking AI Models
Podcast: Linear Digressions
Episode:

Benchmarking AI Models

Category: Technology
Duration: 00:29:55
Publish Date: 2026-03-30 01:29:55
Description: How do you know if a new AI model is actually better than the last one? It turns out answering that question is a lot messier than it sounds. This week we dig into the world of LLM benchmarks — the standardized tests used to compare models — exploring two canonical examples: MMLU, a 14,000-question multiple choice gauntlet spanning medicine, law, and philosophy, and SWE-bench, which throws real GitHub bugs at models to see if they can fix them. Along the way: Goodhart's Law, data contamination, canary strings, and why acing a test isn't always the same as being smart.
Total Play: 0

Some more Podcasts by Katie Malone

300+ Episodes