ChatGPT’s Software Engineering Answers Showed 52% Inaccuracy
OpenAI’s ChatGPT answered about 52 percent of software engineering questions incorrectly, according to the study, raising questions about the accuracy of popular language models.
Despite ChatGPT’s popularity, there has been no in-depth research into the quality and usability of its responses to software engineering surveys, said researchers at Purdue University in the US.
To address this shortcoming, the team conducted a comprehensive analysis of ChatGPT’s responses to 517 questions on Stack Overflow (SO).
“Our study revealed that 52 percent of ChatGPT responses contain inaccuracies and 77 percent are verbal,” the researchers wrote in a non-peer-reviewed paper published on the preprint website.
Importantly, the team found that 54 percent of the time the errors were due to ChatGPT not understanding the concept of the questions.
They said that while it could understand the question, it showed no understanding of how to solve the problem, which contributed to a high number of conceptual errors.
In addition, the researchers observed a limitation of ChatGPT for inference.
“In many cases, we saw ChatGPT provide a solution, code or formula without anticipating or thinking about the end result,” they said.
“Rapid design and in-the-loop fine-tuning can help ChatGPT to understand the problem to some extent, but they are still insufficient when it comes to adding inference to LLM. Therefore, it is essential to understand the drivers of conceptual errors as well as to correct errors due to the limitation of inference” , they added.
In addition, ChatGPT also suffers from other quality issues such as verbosity, inconsistency, etc. The results of in-depth manual analysis showed a large number of conceptual and logical errors in ChatGPT responses. The results of the linguistic analysis showed that ChatGPT responses are very formal and rarely contain negative emotions.
Nevertheless, users preferred ChatGPT’s answers 39.34 percent of the time due to its comprehensiveness and articulate language style.
“These findings highlight the need for careful debugging of ChatGPT while raising user awareness of the potential risks associated with seemingly accurate responses,” the researchers said.