Search


	Podcast:		Linear Digressions
	Episode:		Benchmarking AI Models
	Category:		Technology
	Duration:		00:29:55
	Publish Date:		2026-03-30 01:29:55
	Description:		How do you know if a new AI model is actually better than the last one? It turns out answering that question is a lot messier than it sounds. This week we dig into the world of LLM benchmarks — the standardized tests used to compare models — exploring two canonical examples: MMLU, a 14,000-question multiple choice gauntlet spanning medicine, law, and philosophy, and SWE-bench, which throws real GitHub bugs at models to see if they can fix them. Along the way: Goodhart's Law, data contamination, canary strings, and why acing a test isn't always the same as being smart.
	Total Play:		0

Some more Podcasts by Katie Malone

300+ Episodes

Linear Digre ..