The rise of AI ‘reasoning’ models is making benchmarking more expensive

Karla T Vasquez

1 year ago

AI labs, such as OpenAI, claim that their so-called “logic” AI models, which can “think” through the steps of the steps, are more capable of their non-resource parts in specific domains such as physics. However, it is usually seen in the case, rational models are much more expensive for criteria, making these claims difficult to verify independently.

According to artificial analysis of the third party AI test clothing, the seven popular AI benchmark suit costs $ 2,767.05 for the evaluation of the Openai and 1 argument model: MMLU -Pro, GPQA Diamond, Last Examination of Humanity, Livcode, Livcode, AIME -24.

In the same sets of the test, the latest clad 3.7 sonnet of benchmarking anthropic, a “hybrid” argument model, OpenAI and 3-minute-high-high price $ 344.59, tests $ 344.59 per artificial analysis.

Some rational models are cheaper to benchmarks than others. Artificial analysis has spent for example OpenAI and 1-Mint of 1-Mint, $ 141.22. However, on average they continue to be expensive. All have been said that artificial analysis has spent about a dozen rational models evaluating about $ 5,200, the firm has doubled the amount of 80 non-resoning models ($ 2,400) analyzing.

The non-Reneeing GPT-4O model of the Openai, published in May 2024, spent just $ 108.85 for artificial analysis for evaluation, while Claud 3.6 Sonnet-Cold 3.7 Sonnet Non-Rajjuni-Festival $ 81.41.

George Cameron, co-founder of artificial analysis, told TechCrunch that the company planned to increase its benchmarking expenses because of the development of more AI labs rational models.

“In artificial analysis, we run a few hundred assessments of monthly and dedicate a significant budget to them,” said Cameron. “We are planning to increase this expenditure as models are more frequently published.”

Artificial analysis is not the only type of clothing that is working on the growing AI benchmarking cost.

Ross Taylor, CEO of AI Startup General reasoning, said he recently spent $ 580 to evaluate Claud 4.7 sonnet on around 1,75 unique prompts. Taylor assumes that a single run-throttle of MMLU Pro, a model’s language understanding is a question set designed for the standard of understanding, more than $ 1,800.

“We’re moving to a world where a lab reports x% on a benchmark where they spend the y amount of Y, but where the resources for academics have recent posts of X.”[N]o No one is going to be able to reproduce the results. “

Why is reasonable models so expensive to test? Basically because they make lots of token. Token presents the bits of raw texts, such as the word “fantastic” is divided into syllable “fan,” “” TAS, “and” tick “. According to artificial analysis, the benchmarking examination of the opening and 1 firm has produced 44 million tokens during the test, about eight times the GPT -4O -generated amount is about eight times higher.

Most of the majority of the AI companies charge for the use of the model by token, so you can see how this cost can be added.

Modern criteria reveal a lot of token from models because they have questions involved in complex, multifaceted work, according to APOC AI’s senior researcher Jin-Stanislas Dane, which develops its own model criteria.

“[Today’s] Benchmarks are more complicated [even though] The number of questions towards benchmark has decreased as a whole, “Denine told TechCrunch.

Denine added that the most expensive models have become more expensive Every token Over time For example, on May 2024, anthropic’s clodd 3 Opus Priceist was model, spending $ 70 per million output tokens. OpenAI’s GPT -1.3 and 1 -Pro, both launched early this year, spent $ 150 per million output tokens and $ 600 per million output token respectively.

“[S]Inns models have become better over time, it is still true that the expenditure to reach the paid level over time has dropped a lot, “said Denine.” But if you want to evaluate the best largest models at any time of time, you are paying more. “

Many AI labs, including OpenAIs, provide free or subsidized access to benchmarking companies in their models for testing for testing. However, coloring these results, some experts say – even if there is no evidence of the manipher, the mere advice of the AI lab is threatened to damage the integrity of the evaluation scoring.

“From [a] Scientific point of view, if you publish such results that no one can make a replica with the same model, is it and science? “Taylor wrote a Follow-up post at xThe “Was it ever been science?”

Related Posts