OpenAI’s New o3 and o4-mini Models Show Increased Hallucination Rates


0

OpenAI’s Latest Reasoning Models Are Hallucinating More Than Ever

OpenAI’s recently launched o3 and o4-mini AI models — part of its new generation of reasoning systems — are generating more hallucinations than previous models, according to the company’s internal benchmarks. Despite their strong performance in areas like coding and math, both models show a worrying trend: an increase in inaccurate or fabricated responses.

These models are designed to enhance logical reasoning and problem-solving. However, internal tests reveal that o3 hallucinated 33% of the time on OpenAI’s PersonQA benchmark — twice as much as older models like o1 and o3-mini. The smaller o4-mini fared worse, hallucinating in 48% of responses.

What’s more concerning is that OpenAI currently doesn’t fully understand why hallucinations are on the rise in these advanced models. The technical report accompanying the launch suggests that as reasoning capabilities scale, the number of both correct and incorrect claims also increases — a double-edged sword for AI performance.

External research by nonprofit lab Transluce supports these findings. One example includes o3 falsely claiming it had executed code on a MacBook Pro — something it is technically unable to do. This suggests that reinforcement learning techniques used in training may be contributing to the rise in hallucinations, as noted by former OpenAI staffer Neil Chowdhury.

While some users, like those at Workera, find the o3 model useful in coding workflows, they’ve also flagged recurring hallucinations like broken or non-existent website links. This inconsistency raises concerns for industries like law or finance, where factual precision is critical.

To address accuracy issues, OpenAI has explored web search integration, which helped its GPT-4o with search capabilities achieve 90% accuracy on another benchmark, SimpleQA. Still, whether this approach will reduce hallucinations in reasoning models remains unclear.

With reasoning models increasingly central to AI development, OpenAI faces an urgent challenge. As the demand for reliable and accurate AI continues to grow, balancing intelligence with truthfulness has never been more critical.


Like it? Share with your friends!

0