OpenAI’s o3 AI model scores lower on a benchmark than the company initially implied

Cyber Security, ICT, Most Popular, Trends News

April 21, 2025

No Comments

By Karla T Vasquez

WhatsApp Group Join Now

Telegram Group Join Now

A significance between the first and third party benchmark is the result for OpenAI’s and 3 AI models Raising questions about the transparency of the organization And practice the model testing.

When the opening of OPENDE DECEMB 3, the company claimed that the model could answer the fourth question of Frontiermath, a challenging set of math problems. This score has dismissed the competition and the best model of the next is about 2% of the Frontiermath problem.

“Today, there are all offers less than 2% [on FrontierMath]“OpenAI’s chief research officer Mark Chen, Said during the livestreamThe “We are watching [internally]In the aggressive test-time calculation settings and 3, we are able to get more than 25%””

It turns out that this image was probably a higher bound, which was achieved by a version of 3, which is behind the public opening last week than the more computing.

The Research Institute EPOCH AI has released the results of the Independent Benchmark Examination on Friday and 3. Apoche has shown that he scored about 10%of the 3, under the highest claim score of OpenAI.

Openai and 3, their highly anticipated rational model, and a small and cheap model with 4-min, which is successful in 3-minute.
We have evaluated new models in the suite of our math and science criteria. Thread results! pic.twitter.com/5gbtzkey1b
– Apochey AI (@Apochairs Search) April 18, 2025

This does not mean that the open is lying, he is every. The company published in December shows the results of the benchmark a low-melted score that is seen with the score era. Epocch also points out that its testing setup is probably different than Openai and it has used Frontieroth’s update release to evaluate its evaluation.

“The difference between our results and the openings may be due to more powerful internal scaffolding by using more tests during more testing. [computing]Or because these results were driven in a separate subset of Frontiermath (Frontiermath-2024-11-26 vs Frontiermath-2025-02-28-Private Problems), “180 problems),” Wrote Age

According to a post at X From the Arc Prize Foundation, an organization that examined a pre-publisher version of 3, public and 3 models “a separate model […] Chat/Product is tuned for use, ”corrects the report of the apoc.

“All published O3 count levels are smaller than the version we [benchmarked]”Arc Prize writes. In general, bigger calculation levels are expected to achieve better benchmark scores.

Grant, Open-Open Exam Promise Open Falls briefly is somewhat important, since the company’s O-3-Minit-Uchu and 4-Minnit models have exceeded Frontiermath and 3K, and OPENEA has planned more powerful and 3-Pro-3-Pro-3-Pro.

However, it is another reminder that AI benchmarks are not taken as facial value – especially when the source is a company of services for sale.

Benchmarking “Debate” as the vendors’ competition to capture the title and MindShire with new models is becoming a common event in the AI industry.

In January, the Epok was criticized for waiting to publish funds from the OpenAI after the agency and 3 announced. Many academics who contributed to the Frontierism did not inform the openness of the open.

Recently, Elon Mask Jai was accused of publishing a misleading benchmark chart for his latest AI Model Grock 3. Exactly this month, the benchmark scores in a version of the Meta have acknowledged the score in the touting that the company is provided for developers.