What Mythos Means for Penetration Testing as a Service
When Anthropic announced the Claude Mythos Preview, the reaction from the security community was immediate. We’re not talking about the next best model. This model is such a leap forward and so capable at finding and exploiting vulnerabilities that Anthropic deemed it too dangerous to release publicly. It autonomously discovered thousands of zero-day flaws, including bugs that survived decades of human review across every major operating system and browser. The UK’s AI Security Institute found it could complete expert-level hacking tasks 73% of the time. Tasks that no AI model could complete at all before Mythos.
And Anthropic isn’t alone. Days after Mythos was unveiled, OpenAI launched GPT-5.4-Cyber, a variant of its flagship model specifically optimized for cybersecurity use cases. To illustrate the pace of this progress: OpenAI’s models went from scoring 27% on capture-the-flag security benchmarks to 76% in the span of just three months. Two of the world’s leading AI labs are in a full sprint toward cyber-capable AI. The competitive pressure between them is accelerating capabilities faster than most security teams can track.
This isn’t a trend. It’s an acceleration of access to capabilities and it changes what penetration testing as a service (PTaaS) must deliver.
Our GigaOm Webinar Predicted This Moment
Since hearing about these models, I’ve been thinking a lot about a webinar conversation I had with GigaOm a few weeks ago. I sat down with Chris Ray from GigaOm to discuss their latest Radar for PTaaS. Our focus was on how AI is reshaping penetration testing and specifically, how to spot the difference between genuine agentic capability and “marketing-badge” AI.
Looking back at that conversation, we weren’t being alarmists. If anything, we were underestimating the speed of the curve. Here are the three themes that feel especially urgent right now.
AI-Powered Adversaries Aren’t Coming…They’re Here
Even before these announcements, threat actors were already using LLMs to craft spear-phishing at scale and automate vulnerability discovery. Mythos and the broader frontier model race, crystallizes what “at scale” truly means: systems that autonomously reason about code, chain vulnerabilities, and develop working exploits without a human in the loop.
The clock on your untested attack surface is ticking faster than you think.
This is why continuous security validation—not point-in-time engagements—is no longer optional. If AI can discover zero-days that survived decades of human review, last October’s pentest report is already outdated.
The “AI-Washing” Problem Is Now a Security Risk
Not all AI is created equal and the gap between vendors matters more than ever. During the webinar, Chris Ray put it bluntly:
“Put a vendor on the spot and say, describe to me the difference between your AI and a really well put together Python script. If they have a difficult time relaying how it’s different, that’s a pretty good indication that there’s a lot of fluff involved.”
That test cuts through almost every “AI-powered” claim on the market right now. Chris and I broke down three distinct categories of AI.
Three Tiers of AI in Penetration Testing Tools
- Pattern matching: The foundation of scanners like Tenable, Nessus, OpenVAS. Mature and useful, but not intelligent.
- Generative AI: It’s useful for summarization, report drafting, and explaining findings in plain language. It can describe a vulnerability, but it can’t decide to look for one.
- True agentic capability: This is where Mythos and OpenAI’s cyber-specific models land. The AI reasons, plans, adapts, and chains findings without waiting for a human prompt.
Vendors still in categories one and two aren’t just behind on features. They’re leaving you exposed to adversaries operating in the third.
Four questions to Ask Any Pentest as a Service Vendor
When evaluating a pentesting partner, don’t stop at “do you use AI?” To separate real agentic capability from rebranded automation, ask:
- Does your AI adapt when its first approach fails, or does it stop and wait for a human to retry?
- Does it learn within an engagement, carrying context from one finding into the next, or does every test start from zero?
- Can it chain across multiple applications and services the way a real attacker would, or is each scan scoped to a single asset?
- Does a vetted human researcher review and validate every finding before it reaches your team, or are you the QA layer for an automated scanner?
Human + AI: The Exoskeleton Model Still Holds
Neither Mythos nor GPT-5.4-Cyber makes human security researchers obsolete. Notably, OpenAI’s own strategy is grounded in pairing frontier models with vulnerability research experts and Anthropic’s Project Glasswing follows the same principle. The leading AI labs are arriving at the same conclusion Synack reached more than a decade ago: AI handles breadth, humans handle depth, and the platform that connects them wins.
Even today’s top generative models hallucinate 10% of the time (or higher) under ideal lab conditions. In offensive security, a 10% error rate isn’t a quirk. It’s the difference between a real critical and a wasted weekend chasing a phantom finding. That’s why human validation isn’t optional, it’s the layer that turns AI throughput into trustworthy output.
What Agentic AI Handles and What It Still Cannot
AI is transformative for the linear, high-volume work. Scanning, reconnaissance, configuration review, regression testing, and the first pass of triage are all problems agentic AI can move through at machine speed. That work used to consume the majority of a researcher’s hours. Now it doesn’t have to.
What AI still can’t do, and what 13 years of running offensive security at scale has taught us, is the non-linear work.
- Business logic flaws that only make sense if you understand how a customer actually makes money.
- Multi-step abuse chains that cross applications, identity providers, and third parties.
- The judgment call about whether a finding is technically valid but operationally meaningless, or technically minor but catastrophic in context.
That work requires a vetted human researcher, ideally one who has spent years specializing in your stack and your industry. The Synack Red Team is built precisely for that kind of work: a curated community of the world’s top ethical hackers, screened through one of the most rigorous vetting processes in the industry. When we pair them with the Synack Autonomous Red Agent (Sara), agentic AI doesn’t replace them. It amplifies them.
What This Means for the Future of PTaaS Platforms
Mythos is the most vivid argument yet for why continuous testing isn’t optional. If an AI can find vulnerabilities that survived decades of human review, the question isn’t whether last October’s pentest was thorough enough. It’s whether you’ll find the next vulnerability before your adversaries do.
The future of offensive security isn’t a faster PDF report. It’s continuous coverage, real-time findings, and AI-human collaboration at scale. That future is what we’re delivering through Sara Pentest, which deploys hundreds of specialized AI agents across reconnaissance, attack vectors, and vulnerability triage, working collaboratively in a multi-agent model. The result is machine-speed pentesting that doesn’t sacrifice the human expertise and quality control your security team depends on.
Watch the GigaOm PTaaS Radar Webinar
We covered all of this and more before this latest wave of AI announcements made it urgent. If you want to understand where the Agentic PTaaS market is actually heading, and what GigaOm’s Radar says about who’s doing it for real, the full conversation with Chris is worth your time.
We recorded it before Mythos, but I think you’ll find it’s more relevant now than the day we hit record.
Frequently Asked Questions
What is Claude Mythos?
Claude Mythos Preview is Anthropic’s latest AI model, so capable at finding and exploiting vulnerabilities that Anthropic restricted public access. It autonomously discovered thousands of zero-day flaws across every major OS and browser, and is being deployed through Project Glasswing with a limited group of trusted security partners.
What is OpenAI GPT-5.4-Cyber?
GPT-5.4-Cyber is a fine-tuned variant of OpenAI’s GPT-5.4 model designed for defensive cybersecurity work, with lowered refusal thresholds for legitimate security tasks and new capabilities like binary reverse engineering. It’s available through OpenAI’s Trusted Access for Cyber program to vetted security professionals and enterprise teams.
What is Agentic AI, and how is it different from regular AI in pentesting tools?
Most “AI” in security tools today is pattern matching or summarization. Agentic AI reasons, plans, and adapts in a loop. Attempting an approach, analyzing the result, and pivoting without human intervention. That’s what makes it genuinely transformative for both attackers and defenders.
Does AI replace human security researchers in penetration testing?
No. AI excels at linear, high-volume tasks. Humans remain essential for business logic flaws, novel vulnerability chaining, and contextual judgment models can’t replicate. The best programs use AI for breadth, humans for depth.
What is PTaaS, and what makes a PTaaS platform agentic?
Penetration Testing as a Service (PTaaS) moved pentesting from a point-in-time engagement to a continuous platform-delivered model. Agentic PTaaS takes the next step: AI that can autonomously discover, validate, and prioritize findings—shifting from continuous access to testers to continuous testing of the attack surface.
How should my organization respond to frontier AI models like Mythos?
Assume your attack surface has vulnerabilities manual testing hasn’t caught. Prioritize continuous coverage over annual engagements, demand real-time findings over PDF reports, and ensure your defensive tooling is evolving at the same pace as offensive capabilities.


