On Monday, September 16, a team of tech experts called for the most challenging questions to test artificial intelligence systems, which have increasingly been breezing through popular benchmark tests.
The initiative, named 'Humanity's Last Exam,,' aims to assess when AI reaches an expert level and to remain relevant as AI capabilities evolve, according to the non-profit Centre for AI Safety (CAIS) and the startup Scale AI.
This announcement follows the release of a new model by OpenAI, called OpenAI o1, which has outperformed the most well-known reasoning benchmarks, as noted by Dan Hendrycks, executive director of CAIS and advisor to Elon Musk's xAI.
Hendrycks, who co-authored two influential 2021 papers on AI testing—one on undergraduate-level knowledge and another on advanced math reasoning—remarked that AI previously gave almost random answers to these exams but now excels in them.
For instance, the Claude models from Anthropic improved from a 77 percent score on the undergraduate test in 2023 to nearly 89 percent the following year, as per a capabilities leaderboard. This progress has diminished the relevance of common benchmarks.
AI systems have struggled with less common tests involving planning and visual pattern-recognition, as reported by Stanford University's AI Index Report in April. For example, OpenAI o1 scored only around 21 percent on a version of the pattern-recognition ARC-AGI test.
Some researchers believe that tests involving planning and abstract reasoning are better indicators of intelligence, though Hendrycks argues that the visual component of the ARC test makes it less suitable for evaluating language models. 'Humanity’s Last Exam' will focus on abstract reasoning.
To prevent AI systems from memorizing answers, some questions in the exam will remain confidential. The exam will feature at least 1,000 crowd-sourced questions, to be finalized by November 1, which will be difficult for non-experts. Winning submissions will receive peer review and may earn co-authorship and up to $5,000 in prizes from Scale AI.
"We urgently need more challenging tests for expert-level models to keep up with rapid AI advancements," said Alexandr Wang, CEO of Scale.
One condition of the exam is that it will exclude questions related to weapons, deemed too risky for AI to handle.
Disclaimer: This image is taken from Reuters file