Even Pokémon is not safe from the AI benchmarking debate.
Last week, a X posts Viral has gone viral, claiming that Google’s latest Gemini model has surpassed the ethnographic flagship closed model in the original Pokémon Video Game Trilogy. It is reported that Gemi arrived in Lavender, a developer on Twitch Stream; Claud was stuck in Mount Moon at the end of February.
Jemi literally ahead of the Pokémon Clod ATM after Lavender arrived in town
119 live views simply BTW, incredibly underwtered stream pic.twitter.com/8avsovai4x
– Jush (@Jush 21 e 8) April 10, 2025
But what the post failed to mention was that Mithuni had an advantage: a minimap.
As Reddit users It is mentioned that the developer who maintained the Jemini stream made a custom minip that helps the model detect “tiles” in a game -like game. This reduces the need for analyzing screenshots before deciding the gameplay.
Now, Pokémon is a semi-gourd AI Benchmark is the best-few people to argue that this is a very informative test of a model’s skill. However it Is An educational example of how the various implementation of a benchmark can affect the results.
Ethnographic, for example Report Benchmark Need-Bench has been verified by his recent anthropological 3.7 Sonnet model, designed to evaluate a model’s coding skills. Claud 3.7 Sonnet Sweet-Bench has achieved 62.3% accuracy, but an ethnographic developer is 70.3% with a “custom scaffold”.
Recently, Meta subtles a certain benchmark, LM Arena, has submerged a new model Lama 4 Mavarick version to perform well in LM Arena. The vanilla version of the model score significantly badly on the same evaluation.
Given that the AI Benchmarks-Pookemon included-the incomplete system, custom and non-standard implementation threatens to further mud. It goes without saying, perhaps it does not seem to be easier to compare models as it is published.
