Nobody’s in the Cockpit: The Real Risk of Fully Autonomous AI Security Testing

The curl project, one of the most important pieces of software on the internet, just shut down its bug bounty program. Not because the project is less important. Not because the community gave up. But because 95% of the vulnerability reports it received were not valid. About a fifth were outright AI-generated noise. Only around […]

Abstract visualization of autonomous AI security testing representing the risks of fully automated penetration testing with no human oversight.

Key Takeaways

  • 95% of vulnerability submissions to the curl project were invalid.
  • Major bug bounty platforms now report 60–80% of submissions are invalid, overwhelming triage teams with AI-generated false positives.
  • Every leading frontier AI model still crossed a 10% hallucination rate on factual benchmarks in 2026.
  • The most advanced AI-driven cyberattack on record still required human operators at critical decision points, and hallucination was its biggest limitation.
  • Fully autonomous AI testing doesn’t eliminate noise, it amplifies it.
  • Synack pairs Sara AI Pentesting with the Synack Red Team and a dedicated triage function so that zero false positives reach the customer.

The curl project, one of the most important pieces of software on the internet, just shut down its bug bounty program. Not because the project is less important. Not because the community gave up. But because 95% of the vulnerability reports it received were not valid. About a fifth were outright AI-generated noise. Only around 5% turned out to be real.

This is a program that found and fixed 87 genuine vulnerabilities over its lifetime. It didn’t die of neglect. It was effectively DDoS’d by confident-looking, professionally formatted, hollow AI output.

That’s not an edge case. Major bug bounty platforms now report that 60–80% of vulnerability submissions are invalid. Triage teams are drowning in AI-generated noise that looks real until you dig in.

So when you hear “just automate security testing,” this is the story they’re not telling you.

The Hallucination Problem Isn’t Being Fixed

Here’s the truth about large language models (LLMs): when you ask them to find a vulnerability, they produce one, whether or not a vulnerability actually exists. That’s because LLMs have no concept of truth. They assemble security-sounding language into something that resembles a finding. It looks professional and it has CVSS scores. But it’s often wrong.

And this isn’t improving at the pace people assume. In independent 2026 testing, every leading reasoning model, the newest releases from OpenAI, Anthropic, Google, and xAI, still crossed a 10% hallucination rate on factual benchmarks, with some north of 20%. More capability has not meant more truthfulness.

Independent researchers running frontier models against live targets document the same pattern: large volumes of what initially appear to have high CVSS scores that collapse under inspection with no real-world reachability, no exploitability, or outright fabrication. The conclusion from those researchers: effective AI-driven discovery still requires human expert oversight.

To be clear, no one is arguing that a human should monitor every step an AI takes. That doesn’t scale, and it was never the point. Modern aviation runs on autopilot because nobody expects a pilot to hand-fly every second of a transatlantic flight. But you don’t then conclude the cockpit should be empty. The pilot is there for the storm no model predicted, for the failure no checklist covered, for the one-in-a-thousand moment that demands judgment. 

Cutting the human out of AI security entirely isn’t automation. It’s a jet full of passengers with nobody in the cockpit.

The Attackers Know This Too

Look at the most advanced AI-driven cyberattack ever documented. In late 2025, Anthropic disrupted an espionage campaign attributed with high confidence to a Chinese state-sponsored group. The operation hit roughly 30 targets: tech firms, financial institutions, chemical manufacturers, government agencies. The AI ran an estimated 80–90% of the operation.

Sounds like proof that autonomous AI attacks have arrived. Read the rest.

Human operators still selected the targets. The AI ran autonomously against them, with human intervention at the handful of critical decision points that actually steered the campaign. And the thing that limited the machine’s autonomy? Hallucinations. It overstated what it had found. It fabricated credentials. It flagged public information as critical discoveries. Anthropic called that an obstacle to fully autonomous cyberattacks.

The most capable AI attackers on the planet still kept humans in the loop. Not because they wanted to. Because the hallucination problem forced them to.

That cuts both ways. AI misses what skilled humans catch, like the business-logic flaw, the creative exploit chain. Humans miss what AI surfaces at scale. Neither is complete alone. If the adversaries haven’t solved this, neither has your AI pentesting vendor.

What Fully Autonomous Pentesting Delivers in Practice

Picture an AI pentesting engine turned loose inside your production network. No oversight, no triage, just “find everything.”

What you get isn’t security. You get a firehose of confident, professionally formatted findings—most of them wrong—pointed at a team with no way to drink from it. The one finding that actually matters gets buried under 400 others that don’t.

Volume is not value. It’s noise. And in security, noise is how the real thing gets missed.

The AI + Human Pentesting Model That Actually Works

At Synack, we’ve spent 13 years building toward a different answer. Sara AI Pentesting brings scale, speed, and coverage running continuously, across a broader attack surface than any manual team can match on its own. That’s real. We lean into it.

The Synack Red Team, our vetted community of ethical hackers, brings what AI can’t: adversarial creativity, business context, and the judgment to tell a theoretical vulnerability from an exploitable one. And between every finding and every customer sits a dedicated triage team plus AI triage agents with exactly one mandate: zero false positives reach you.

A finding is only worth something if it’s real, complete, and a qualified human stands behind it. Quality over volume. Completeness over speed. Human judgment on top of machine scale. That’s what’s next for the future of security.

Learn more about Sara AI Pentesting, or request a demo.

Learn how the Synack Platform can secure your organization