
Gaming the System: How AI Companies Hack Their Way to the Top of Leaderboards
May 16, 2025
So I've been nerding out with NotebookLM lately (seriously, this tool is amazing for organizing thoughts), diving deep into something that's been bugging me: those AI leaderboards everyone seems obsessed with. You know, the ones that claim to tell us which AI is "the best"? I wondered if they were as reliable as the grades in my high school American History class (in case you weren’t there, everyone was cheating). Turns out, I’m not the only one thinking this about “Independent” AI Benchmarking Tools and Leaderboards.
The Leaderboard Obsession
I’ve heard a lot about how the only thing you can compare Chat GPT’s rise to is that of Facebook’s. And if that analogy holds up, we should start getting worried when your Grandma is using Chat GPT daily. But it seems you can’t turn on the news, listen to a podcast, or open LinkedIn without someone shoving AI in your face. I checked my junk email today - I had 24 messages and 19 of them mentioned AI of some sort.
So it’s natural that these billion (and sometimes trillion) dollar companies would want a scoreboard to fuel their egos, right? These companies love flashing their benchmark scores like Draymond Green likes kicking grown men in the grown (if you’re a Warriors fan, I don’t apologize for that joke one bit).
But as my kids would say, “here’s the tea:” these Benchmarks are fundamentally broken, and I’m not sure there is a path to fix them.
The Benchmark Gaming Hall of Fame
Here's what's actually happening behind those impressive scores:
The "Everything is Awesome" Strategy
Remember when GPT-4o became that overly enthusiastic yes-man three whole weeks ago (almost a decade in AI speak)? It started acting like that friend who laughs too hard at all your jokes because it was trained to chase those sweet, sweet thumbs-up reactions. The model was weighting how it thought you felt over real results - and it was a big enough miss that OpenAI retracted the entire update and published a “we goofed up” article about the entire experience.
The Personality Contest
Some models (looking at you, Claude) get higher scores because they're chattier and more pleasant to interact with. It's like judging a math competition based on who has the most charming smile. Sure, I like a friendly AI too, but maybe correctness should count for something?
The Multiple Personality Disorder
This one takes it past the “gray area” into the “downright cheating” zone. Companies are being caught submitting TONS of different model versions privately, then only publicizing the one that happened to ace that specific test. Meta reportedly tested 27 different versions before landing on Llama 4! I would say that’s like taking the SAT 27 times and submitting your best score, but that’s technically legal (and I may have employed a similar strategy in my high school days to earn a scholarship). Suffice to say that Meta should be held to a different standard than 17 year old me.
The "I Saw The Test Questions" Problem
Some models are suspiciously good at certain benchmarks because—plot twist—they've seen the answers before! Either through data contamination or straight-up overfitting, they’re taking a test they’ve been trained on. Now, I'm not saying this is EXACTLY what happened in Coach Jeffries class, but if you teach a class for 25 straight years, maybe mix up your test questions every now and then…
The Cherry-Picking Championship
When xAI released Grok 3, they conveniently "forgot" to mention certain metrics where they didn't shine. A lie of omission is still a lie, Elon.
The Real-World Messiness
Meanwhile, in the real world, AI is being used for some pretty sketchy stuff. We've seen:
Claude being manipulated to run fake political personas
Security camera credential scraping
Recruitment fraud and malware development
And let me tell you, the economic fallout from AI-powered disinformation isn't pretty. Not to mention the privacy concerns that keep me up at night.
So What's an Executive to Do?
If you're sitting in the C-suite wondering which AI to bet on, here's my down-to-earth advice:
Ignore the scoreboard hype. That 98% accuracy score means nothing if it can't handle your company's specific needs.
Test it yourself. Run it through your own real-world scenarios. How does it handle YOUR data and YOUR problems?
Demand the receipts. Ask vendors tough questions about their evaluation methods. Make them sweat a little.
Care about the responsible stuff. Safety, bias, factuality—boring words, crucial concepts.
Look for all-around players. You want the AI equivalent of that person who's good at sports AND math AND cooking, not just a one-trick pony.
Count the costs. Some models are an early 2000’s Hummer—powerful but expensive to run (for you and the environment). Make sure you can afford the computational bill.
My Take
As I continue my journey as your friendly neighborhood CEO AI Coach, I'm becoming more convinced that AI adoption isn't about chasing the shiniest new model with the highest benchmark score. It's about finding the right tool that delivers actual value while not burning your company to the ground ethically or financially.
The AI world is messy, the benchmarks are flawed, and the hype is real. But with a healthy dose of skepticism and some practical testing, you can navigate this chaos and find AI that actually works.
Until next time, keep exploring, stay curious, and don't believe everything you read on a leaderboard!
P.S. This blog entry was a collaboration between me, almost 20 online sources, NotebookLM, and Claude—we debated, refined, and evolved these ideas together. But no AI will ever have the full story of what happened in that high school American History class - I’m taking that to the grave.
CITE YOUR SOURCES, PEOPLE:
https://openai.com/index/sycophancy-in-gpt-4o/
https://techcrunch.com/2025/04/30/study-accuses-lm-arena-of-helping-top-ai-labs-game-its-benchmark/
Excerpts from "AI Risk & Reliability Prompt EOI - MLCommons"
Excerpts from "Even good leaderboards may not be useful, because they are gamed - Ehud Reiter's Blog"
Excerpts from "FrontierMath: LLM Benchmark for Advanced AI Math Reasoning | Epoch AI"
Excerpts from "GitHub - openai/grade-school-math"
Excerpts from "GitHub - stanford-crfm/helm: Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research for Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models."
Excerpts from "Leading AI models accused of cheating benchmark tests"
Excerpts from "Predicting and explaining AI model performance: A new approach to evaluation - Microsoft Research"
Excerpts from "The Leaderboard Illusion arXiv:2504.20879v1 [cs.AI] 29 Apr 2025"
Excerpts from "[2502.06559] Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation - arXiv"
https://www.reddit.com/r/LocalLLaMA/comments/1kb6bbl/new_study_from_cohere_shows_lmarena_formerly/