Not So Smart: Study Finds 52 Percent of ChatGPT Answers to Programming Questions Are Wrong

Introduction

In recent years, many computer programmers have started using chatbots like OpenAI’s ChatGPT to assist with coding, causing a decline in the popularity of platforms like Stack Overflow. This trend led to significant layoffs at Stack Overflow, where nearly 30 percent of the staff were let go last year. However, new research from Purdue University, presented at the Computer-Human Interaction conference, reveals a significant flaw in this reliance on AI: 52 percent of programming answers generated by ChatGPT are incorrect.

Key Findings

The Purdue study scrutinized 517 questions from Stack Overflow, comparing them to ChatGPT’s responses. The findings were concerning:

Misinformation: 52 percent of ChatGPT’s answers contained misinformation.
Verbosity: 77 percent of ChatGPT’s answers were more verbose than those provided by humans.
Inconsistency: 78 percent of the answers showed different degrees of inconsistency compared to human responses.

These statistics highlight the potential risks and inaccuracies when relying on ChatGPT for coding advice, underlining issues already noted by other users such as writers and teachers.

Linguistic Analysis

Further analysis of 2,000 randomly selected ChatGPT answers revealed distinct linguistic characteristics:

Tone: ChatGPT’s responses were generally more formal and analytical.
Sentiment: The answers exhibited less negative sentiment, often presenting a bland, cheery tone typical of AI-generated content.

Programmer Preferences

The study also investigated human programmers’ preferences and error detection abilities concerning ChatGPT’s answers. In a small sample of 12 programmers, the researchers found:

Preference Rate: 35 percent preferred ChatGPT’s answers over human answers.
Error Detection: 39 percent failed to catch mistakes in AI-generated answers.

Reasons for AI Preference

Interviews with the participants revealed why some programmers preferred ChatGPT’s responses despite the inaccuracies:

Politeness: ChatGPT’s polite language made responses more appealing.
Articulation: The answers were well-articulated and resembled textbook explanations.
Comprehensiveness: The thoroughness of ChatGPT’s answers often gave a misleading impression of accuracy.

Implications and Conclusion

This study highlights significant flaws in ChatGPT’s ability to provide accurate programming assistance. These findings pose a dilemma for programmers who may rely on AI-generated content and for those affected by the shift away from platforms like Stack Overflow. While ChatGPT’s polite and comprehensive responses might seem attractive, the high rate of misinformation necessitates caution. As AI continues to evolve, understanding and mitigating these shortcomings will be crucial for developers and users alike.

Tech Maestros