Breaking

News

Google AI Overviews Wrong 10% of the Time—at Massive Scale

Google's AI Overviews Are Getting Better and Still Getting Millions of Things Wrong Google's AI Overviews has quietly become one of the most consequential AI deployments on the planet. It sits atop billions of searches daily, summarizing reality for people who just want a quick answer.

Google AI Overviews Wrong 10% of the Time—at Massive Scale
Daily Neural — Latest Artificial Intelligence News Today

Google's AI Overviews Are Getting Better and Still Getting Millions of Things Wrong

Google's AI Overviews has quietly become one of the most consequential AI deployments on the planet. It sits atop billions of searches daily, summarizing reality for people who just want a quick answer. And according to a new analysis commissioned by The New York Times and executed by AI startup Oumi, it delivers the wrong answer roughly one out of every ten times.

That 91 percent accuracy figure, achieved after the Gemini 3 upgrade, sounds like progress — and it is. When the same test ran with Gemini 2 last October, the correct-answer rate sat at 85 percent. Six percentage points of improvement over a few months is not nothing. But at Google's scale, that remaining 9 percent becomes an almost incomprehensible volume of misinformation flowing through the world's most-used information tool. Estimates suggest AI Overviews generates hundreds of thousands of wrong answers per minute.

Oumi ran its evaluation using SimpleQA, a factuality benchmark developed by OpenAI consisting of more than 4,000 questions with objectively verifiable answers. The benchmark was designed to be hard — questions were pre-screened to include only those that tripped up at least one AI model — which means it's not a perfect proxy for everyday search behavior. Google was quick to flag this, with spokesperson Ned Adriance calling the study fundamentally flawed and noting that SimpleQA itself contains errors and was built for models operating without internet access. These are fair points, but they don't fully defuse the concern. Google's rebuttal amounts to arguing that the test is too hard, not that the errors don't exist.

The specific failure modes Oumi documented are more revealing than the aggregate number. Asked when Bob Marley's Kingston home became a museum, AI Overviews cited three sources, two of which didn't contain the answer at all. The third — Wikipedia — listed conflicting years, and the AI confidently chose the wrong one. Asked about Yo-Yo Ma's induction into the Classical Music Hall of Fame, the system found the correct organization's website, then stated flatly that no such institution exists. These aren't edge cases of ambiguous information; they're failures to correctly read sources the system itself surfaced.

The accuracy story gets more complicated when you factor in source verifiability. Even as correctness improved with Gemini 3, the percentage of correct answers that could actually be confirmed through Google's cited sources got worse. With Gemini 2, about 37 percent of correct answers were "ungrounded" — meaning the linked pages didn't actually support the claim. With Gemini 3, that figure jumped to 56 percent. The system is getting better at producing right answers, but increasingly those answers aren't traceable back to anything Google shows you. That's a strange and troubling tradeoff: more accurate, less verifiable.

The sourcing quality raises additional flags. Of the nearly 5,400 sources Google cited across the study, Facebook and Reddit ranked among the top four most commonly referenced. Facebook appeared in seven percent of incorrect answers. This matters less as a quality-of-the-web problem and more as a signal about what content Google's retrieval system is actually leaning on — and, cynically, which sources are least likely to pursue legal action over content use.

This puts pressure on Microsoft's Bing and its Copilot integration, which faces the same accuracy-versus-scale problem but with far less traffic to expose the failure rate. It also puts pressure on Perplexity, which has staked its entire brand identity on being a more trustworthy, citation-forward alternative to Google's summarization approach. If Perplexity can demonstrably outperform Google on verifiability — not just raw accuracy — that becomes a genuine competitive wedge.

The deeper issue, though, isn't the error rate in isolation. It's what AI Overviews is doing to the incentive structure of the open web. When a sufficiently accurate AI answer sits above the search results, most users stop clicking through to publishers, news sites, and specialized sources. That traffic is the economic lifeblood of the very ecosystem Google's AI is being trained on. Studies have consistently shown AI Overviews reducing outbound clicks, and Google has consistently declined to share any traffic data of its own — a silence that is itself a kind of answer.

OpenAI, to its credit, at least acknowledged the tension when it launched ChatGPT's web-browsing features, publicly stating its interest in "the overall health of the ecosystem." That concern has since gone quiet as search features scaled. Google has never made even that gesture.

What This Means

The headline number — 91 percent accuracy — will be used by Google to argue that AI Overviews is reliable enough to be the primary interface between users and information. That framing deserves scrutiny.

  • For developers building on top of search APIs or LLM pipelines: Treat AI-generated summaries as drafts, not sources. The 9 percent error rate compounds downstream — if your product chains multiple AI-generated outputs, error rates multiply, not add.
  • For founders in the search or information-retrieval space: The verifiability gap is the real opportunity. Sixty-one percent of AI Overview answers can't be confirmed through linked sources. A product that answers with genuine citation integrity — not just a list of links — has a clear differentiation story right now.
  • For tech enthusiasts and everyday users: The "AI responses may include mistakes" disclaimer at the bottom of every AI Overview is not boilerplate. It's load-bearing text. High-stakes queries — medical, legal, financial, historical dates that actually matter — warrant clicking through to primary sources, not trusting the summary.
  • For the broader ecosystem: The accuracy-versus-verifiability tradeoff revealed here is not a Gemini-specific bug. It's likely a structural feature of how retrieval-augmented generation works at scale. Models get better at producing plausible-sounding correct answers, but the connection between those answers and their supporting evidence gets looser. That's a research problem the whole industry needs to take seriously, not just Google.

The real question is one the study itself acknowledged but couldn't answer: are users better informed with AI Overviews than without them? That's the honest benchmark. And until Google is willing to let independent researchers actually measure it, the 91 percent figure tells only part of the story.

Written by