Recent advances in artificial intelligence have widespread the use of large language models in our society, in areas such as education, science, medicine, art or finance, among many others. These models are increasingly present in our daily lives. However, they are not as reliable as users expect. This is what a recent study concludes.
The study has been led by a team from the VRAIN Institute of the Polytechnic University of Valencia (UPV) and the Valencian Graduate School and Research Network in Artificial Intelligence (ValgrAI), in Spain, together with the University of Cambridge in the United Kingdom.
The work reveals an alarming trend: compared to the first models, and taking into account certain aspects, reliability has worsened in the most recent models (ChatGPT-4 compared to ChatGPT-3, for example).
According to José Hernández Orallo, researcher at the Valencian University Institute for Research in Artificial Intelligence (VRAIN) of the UPV and ValgrAI, one of the main concerns about the reliability of language models is that their operation does not adjust to human perception. of task difficulty. In other words, there is a mismatch between the expectations that models will fail according to the human perception of task difficulty and the tasks where the models actually fail. “Models can solve certain tasks that are complex according to human abilities, but at the same time fail in simple tasks of the same domain. For example, they can solve several doctoral-level mathematical problems, but they can make a mistake in a simple addition,” says Hernández-Orallo.
In 2022, Ilya Sutskever, the scientist behind some of the biggest advances in artificial intelligence in recent years (from Imagenet’s solution to AlphaGo) and co-founder of OpenAI, predicted that “maybe over time that discrepancy will decrease.”
However, the study by the UPV team, ValgrAI and the University of Cambridge shows that this has not been the case. To demonstrate this, they investigated three key aspects that affect the reliability of language models from a human perspective.
There are not rare cases in which an artificial intelligence solves several doctoral-level mathematical problems, but makes a mistake on a simple sum. (Illustration: Amazings/NCYT)
There is no “safe zone” in which models work perfectly.
The study confirms a disagreement with the perception of difficulty. “Do models fail where people expect them to fail? Our work concludes that models are often less accurate on tasks that humans consider difficult, but they are not 100% accurate even on simple tasks. This means that there is no “safe zone” in which the models can be trusted to work perfectly,” says Yael Moros Daval, researcher at the VRAIN Institute of the UPV.
In fact, the team from the VRAIN Institute, ValgrAI and the University of Cambridge assures that the most recent models basically improve their performance on high difficulty tasks, but not on low difficulty tasks, “which aggravates the difficulty mismatch between performance of human models and expectations,” adds Fernando Martínez Plumed, also a VRAIN researcher.
More likely to provide incorrect answers
The study also finds that recent language models are much more likely to provide incorrect answers, rather than avoiding answers to tasks they are not sure about. “This can lead to users who initially trust the models too much later becoming disappointed. Furthermore, unlike people, the tendency to avoid providing answers does not increase with difficulty. For example, humans tend to avoid giving their opinion on problems that exceed their capacity. This relegates the responsibility of detecting failures during all their interactions with the models to users,” adds Lexin Zhou, a member of the VRAIN team who also participated in this work.
Sensitivity to the problem statement
Is the effectiveness of question formulation affected by their difficulty? This is another of the issues analyzed by the UPV, ValgrAI and Cambridge study, which concludes that it is possible that the current trend of progress in the development of language models and greater understanding of a variety of orders does not free users to worry about making effective statements. “We have verified that users can be influenced by prompts that work well in complex tasks but, at the same time, obtain incorrect answers in simple tasks,” adds Cèsar Ferri, also co-author of the study and researcher at VRAIN and ValgrAI.
Human supervision unable to compensate for these problems
In addition to these findings about aspects of the unreliability of language models, researchers have found that human supervision is unable to compensate for these problems. For example, people may recognize high-difficulty tasks, but still frequently judge incorrect results as correct in this area, even when allowed to say “I’m not sure,” indicating overconfidence.
From ChatGPT to LLaMA and BLOOM
The results were similar for multiple families of language models, including OpenAI’s GPT family, Meta’s LLaMA open weights, and BLOOM, a fully open initiative of the scientific community.
Researchers have also found that issues of difficulty mismatch, lack of appropriate abstention, and prompt sensitivity continue to be a problem for new versions of popular families.
“In short, large language models are increasingly less reliable from a human point of view, and user supervision to correct errors is not the solution, since we tend to trust the models too much and are unable to recognize incorrect results at different levels of difficulty. Therefore, a fundamental change is necessary in the design and development of general-purpose artificial intelligence, especially for high-risk applications, in which the prediction of the performance of language models and the detection of their errors are essential. ”, concludes Wout Schellaert, researcher at the VRAIN Institute.
The study is titled “Larger and more instructable language models become less reliable.” And it has been published in the academic journal Nature. (Source: Polytechnic University of Valencia)
Add Comment