Debates over AI benchmarking have reached Pokémon

Cyber Security, ICT, Most Popular, Trends News

No Comments

Photo of author

By Karla T Vasquez

WhatsApp Group Join Now
Telegram Group Join Now


Even Pokémon is not safe from the AI ​​benchmarking debate.

Last week, a X posts Viral has gone viral, claiming that Google’s latest Gemini model has surpassed the ethnographic flagship closed model in the original Pokémon Video Game Trilogy. It is reported that Gemi arrived in Lavender, a developer on Twitch Stream; Claud was stuck in Mount Moon at the end of February.

But what the post failed to mention was that Mithuni had an advantage: a minimap.

As Reddit users It is mentioned that the developer who maintained the Jemini stream made a custom minip that helps the model detect “tiles” in a game -like game. This reduces the need for analyzing screenshots before deciding the gameplay.

Now, Pokémon is a semi-gourd AI Benchmark is the best-few people to argue that this is a very informative test of a model’s skill. However it Is An educational example of how the various implementation of a benchmark can affect the results.

Ethnographic, for example Report Benchmark Need-Bench has been verified by his recent anthropological 3.7 Sonnet model, designed to evaluate a model’s coding skills. Claud 3.7 Sonnet Sweet-Bench has achieved 62.3% accuracy, but an ethnographic developer is 70.3% with a “custom scaffold”.

Recently, Meta subtles a certain benchmark, LM Arena, has submerged a new model Lama 4 Mavarick version to perform well in LM Arena. The vanilla version of the model score significantly badly on the same evaluation.

Given that the AI ​​Benchmarks-Pookemon included-the incomplete system, custom and non-standard implementation threatens to further mud. It goes without saying, perhaps it does not seem to be easier to compare models as it is published.



Leave a Comment