Anne-Marie Kermarrec, professeure en systèmes distribués à large échelle à l’EPFL. — © Alain Herzog
Anne-Marie Kermarrec © Alain Herzog

The tank is almost empty. It contained university entrance exams, complex mathematical problems and, of course, the famous Turing test. But week after week, ChatGPT and competing systems manage to pass all these exams, with increasingly high scores. Hence the idea, launched this week, to find other ways of judging the progress of ultra-high-performance artificial intelligence (AI) services. This race is gathering pace, as calls to regulate this technology have multiplied in recent days.

Last autumn the Center for AI Safety (CAIS), whose mission is to minimize the risks caused by AI, in association with start-up Scale AI, launched an appeal. “Humanity needs to maintain a good understanding of the capabilities of AI systems. Existing tests have become too easy, and we can no longer properly track the evolution of AI, nor know what it lacks to reach expert level”, according to CAIS, which has thus proposed a competition, called ‘Humanity’s Last Test.’ Its aim: to create “the most difficult AI test in the world.”

$5000 for the winners

Anyone can submit a test. The questions must be difficult for non-experts and cannot be solved by a quick online search, says CAIS, which warns, “Do not submit questions relating to chemical, biological, radiological, nuclear weapons, cyberweapons or virology.” The authors of the 50 questions deemed most interesting will receive $5,000 each, with the next 500 questions rewarded with $500.

What about this search for new tests to measure AI progress? “Ensuring the robustness of AI models and their correctness is vital for adoption, as we know that today we can’t trust them blindly. So, yes, the search for effective tests is highly relevant. That’s what Turing was trying to do with his test of the same name,” says EcoCloud’s Anne-Marie Kermarrec, professor of Scalable Computing Systems at EPFL.

According to the expert, “AI is progressing at a very high speed to ‘imitate’ human reasoning by performing increasingly complex tasks. As far as creativity is concerned, which can be defined as producing new objects, for example artistic ones, we can consider that AI is indeed capable of producing new works of art such as poetry, songs, music or paintings. Human creativity, on the other hand, cannot be judged by the same criteria.”

As proof of this progress, last year OpenAI launched the o1 version of ChatGPT, capable, according to the American company, of “reasoning”. “In a qualifying exam for the International Mathematical Olympiad, GPT-4o solved only 13% of the problems correctly, while the reasoning model scored 83%,” boasted OpenAI. Success stories of this kind have been multiplying in recent months. In January 2023, an earlier version of ChatGPT passed exams at a US law school after writing essays on topics ranging from constitutional law to taxation to torts.

Turing Test Achieved

And in June of this year, researchers at the University of San Diego claimed to have put ChatGPT through the Turing test mentioned above by Anne-Marie Kermarrec for the first time. In a nutshell, this test, proposed by British mathematician Alan Turing in 1950, consists in determining whether a machine can imitate human intelligence to the point where a person would not be able to distinguish the machine from another human being during a conversation. In the case of the San Diego test, more than half of the 500 participants mistakenly thought they were conversing with a human.

Are we moving towards AIs capable of “reasoning”, as OpenAI claims for the o1 version of ChatGPT?

“This version is based on a more complex model, which takes more time to respond – which, incidentally, is not proof of more thinking, but of more calculations,” asserts Anne-Marie Kermarrec. The professor continues: “I wouldn’t go so far as to say that these models do ‘reasoning’. They exploit data even more, and in different ways. Moreover, this increased complexity risks complicating explicability and transparency. For the time being, the notion of reasoning seems to me to be a marketing issue, even if I have no doubt that these models can handle more complex tasks than their predecessors. It’s more a question of the ability to mimic reasoning well.”

Article by Anouch Seydtaghia, originally published in French on the 21st September, 2024