Crowdsourced AI benchmarks have serious flaws, some experts say

Karla T Vasquez

12 months ago

AI labs are increasingly depending on the crowded benchmarking platforms such as Chatboat Arena to investigate the strengths and weaknesses of their latest models. However, some experts say that from a moral and academic point of view there is serious problems with this approach.

Over the past few years, labs, including OpenAI, Google and Meta, have turned into platforms that employ users to evaluate the capacity of the coming models. When a model is favorable, the lab on the back of it is often scored as proof of meaningful improvement.

Washington Linguistics University Professor and “The AI Con.” According to Emily Bender, co-authors of the book, this is a defective approach, but “The AI Con.” Bender takes special issues with Chatboat Arena, which requested two anonymous models of volunteers and work to select their preferred response.

“To be valid, a criterion needs to be measured certain, and its construction needs to be valid-it is well defined in interest, and the measurements are actually related to construction,” said Bender. “Chatboat Arena did not see that voting for one output on the other was actually related to preferences, but they could be defined.”

Asmelash Teka Hadgu, co-founder of AI Farm Lesan and a colleague of the Distributed AI Research Institute, says that he thinks benchmarks like Chattboat Arena are being “co-extended” to promote “exaggerated claims” by AI Labs. Hadgu Meta Lama points to a recent controversy involving 4 Mavarick model. Meta Fine-Tune-Mavarick to score well in chatboat Arena in a version of Mavarick, just to prevent that model in favor of publishing a worse performance version.

Hadgu said “Benchmarks should be dynamic than fixed data sets,” multiple distinct entities have been distributed to agencies or universities, and especially for education, healthcare and other fields that are specially created for those who use them in distinct use [models] For the job. “

Hadgu and Christine Gloria, who previously led the emerging and intelligent technology initiative of the Aspen Institute, also made the case that model evaluation should be compensated for their work. Gloria said that AI labs need to learn from the mistakes in the data labeling industry, which Notorious For it Exploitative PracticeThe (Some labs have been Accused The same.)

Gloria said, “Generally, the crowded benchmarking process is valuable and reminds me of a civic science initiative,” Gloria said. “Ideally, it helps to bring extra outlook to provide some depth in both the data evaluating and subtle tune

Matte Frederickson, CEO of Gray Swan AI, conducted the crowded red teaming campaign for models, said that volunteers were attracted to Gray Swan’s platform for various reasons, including “learning new skills and practicing”. (Gray Swoan also offers cash prizes for some tests.) Nevertheless, he acknowledges that public criteria are “alternative” for “private” evaluation.

“[D]Frederickson said, “Frederickson said,” Avelopers also need to rely on internal criteria, algorithmic red teams and contract red timers who can adopt more open approaches or bring specific domain skills, “said Frederickson.

Model Marketplace OpenArser CEO Alex Atlah, which recently parted with the OpenAI, to provide primary access to users in OpenAI’s GPT -1.5 models, open testing and bench marking of alone models “not enough”. UC Berkeley’s AI doctoral student Wei-Lin Chiang and Lamarana are one of the founders, which maintains chatboat arena.

“We certainly support the use of other tests,” said Chiang. ” “Our goal is to create a credible, open space that measures our community preferences about different AI models.”

Chiang said that events like the Maverrick benchmark are not the result of any error in the chatboat arena design, but the labs misinterpret its policy. LM Arena has taken steps to prevent future significance, including Chiang, “including updating its policies to strengthen our commitment to fair, reproductive evaluation.”

“Our community is not here as a volunteer or model examiner,” Chiang said. “People use LM Arena because we give them an open, transparent place to get involved with AI and to combine. We welcome to share it until the Leaderboard reflects the voice of the community with faithfully.”

Related Posts