Cricket, Context and AI Benchmarks

Just like cricket statistics hide the context of the pitch and conditions, AI benchmarks are tools — signals, not the whole story.

I have always been fascinated by data and statistics. There’s a hidden meaning behind every number, and growing up in India, that curiosity often began with cricket. We would watch all kinds of stats flash on the TV — a No. 3 batter with a stack of fifties, a No. 5 with fewer — and intuitively know there were deeper reasons behind those patterns. Context always mattered.

The kind of pitch a player faces matters. Some cricketers perform better on difficult international grounds, helping the team win even if their averages look modest on paper. Others may have eye-catching stats but under very different conditions.

At the end of the day, the way you present the statistics shapes the story people see.

It’s the same with evals and benchmarks. Everyone tests their models using their own strategies, and none of them are inherently wrong. Each setup reflects what you want to highlight. As long as the experiments are reproducible, others can verify the claims — and that’s what ultimately keeps the ecosystem honest.

Showing SOTA is a double-edged sword. For some, it’s genuine scientific progress and a validation of their ideas. For others, it becomes a marketing lever — something that generates attention regardless of how meaningful the improvement actually is.

Also, two people can look at the same stats and reach different conclusions:

  • a coach sees potential,
  • a critic sees inconsistency.

Similarly, researchers may interpret the same results in completely different ways depending on their philosophy.

But at the end of the day, benchmarks are tools. They help you understand where you stand, strengthen your conviction, and reassure you that you are moving in the right direction. They are signals — and not the whole story.