GPT-5.5: Mythos-Like Hacking, Open to All

71 points by rs_rs_rs_rs_rs 2 days ago|22 comments

•

JellyYelly 21 hours ago

They say its mythos like, without actually comparing it to Mythos (fair enough, it's not public) but the bar for a model to be mythos-like has to be that you can produce as many novel and high severity security vulns outlined in the Mythos redteam blog. I haven't seen any other lab produce a report like that yet. The proof is in the pudding.

•

cassianoleal 13 hours ago

> The proof is in the pudding.

Funny you say that, when the Mythos team have produced no proof either.

•

subscribed 11 hours ago

Not sure if the reports like this count? https://www.theregister.com/2026/04/22/mozilla_firefox_mytho...

I don't have strong opinion on that.

•

maplethorpe 13 hours ago

I believe they've stated that it would be too dangerous to release.

•

satvikpendem 7 hours ago

Just like OpenAI said GPT 2 was too dangerous to release?

There was just an article on this phenomenon today: https://news.ycombinator.com/item?id=47890235

•

maplethorpe 5 hours ago

They released a system card talking about how powerful it was. I don't think OpenAI did that with GPT 2.

•

satvikpendem 2 hours ago

I mean, that's just part of the marketing too. OpenAI would've absolutely added a system card, they just weren't invented back in the GPT 2 era.

•

cedws 9 hours ago

Open to all except it’s not because as soon as you try to use it for security purposes it will shut down and silently route you to a worse model. I was trying to use GPT 5.3 for reverse engineering and got an account warning.

•

WhiteDawn 22 hours ago

First you need to get through the safety net. I’ve had many productive gpt5.4 sessions hit a roadblock of “ethicality” and pollute the context with multiple rounds of trying to convince it to continue

•

nsingh2 23 hours ago

These plots are terrible. Why is categorical data connected across categories with lines? Why not just use bar plots?

Like in the "Web Vulns in OSS" plot, white box data for Opus 4.7 is not available, but the absurd linear interpolation across categories implies it should be near 60.

•

scottyah 23 hours ago

It's just an ad thinly disguised as useful data.

•

wmf 22 hours ago

I think the x axis is meant to be time but they screwed it up.

•

strange_quark 23 hours ago

Wasn't it already confirmed that small open-weight models were able to detect most of the same headline vulns as mythos? How is this any different?

•

stanfordkid 22 hours ago

No, they are able to detect errors when pointed at them but they have a lot of false positives... making them functionally useless for a large unknown codebase. They also can't build and run an exploit post-identification. Mythos can find vulnerabilities (purportedly) and actually validate them by building and running exploits. This makes it functional and usable for hacking.

•

adrian_b 10 hours ago

The only significant difference between Mythos and the older open-weights models was that Mythos found all the bugs alone, while with the older models you had to run many of them in order to find all bugs, because each model found only a part of the bugs.

For the open weights models, we know the exact prompts that have been used to find the bugs. While the prompts had to be rather specific, a good bug-finding harness should be able to generate such prompts automatically, i.e. by running repeatedly a model while requesting to find various classes of bugs.

For Mythos, we do not know what prompts have been used, but Anthropic has admitted that the process was nothing like asking "find the bugs in this project". They have also run Mythos many times on each source file, starting with more generic prompts in order to identify whether a source file is likely to have bugs, and then following with more and more specific prompts, until eventually it became likely that a certain kind of bug exists, when Mythos was run one last time with a prompt that required the confirmation that the bug exists and the possible generation of an exploit or patch.

So Mythos must also be pointed to an error. Using it naively will not provide any results like those reported.

There is no doubt that both Mythos and GPT 5.5 are superior to older models, because you can use a single model and hope to have an adequate bug coverage. But the difference between them and older models has been exaggerated. If you run older models on your own hardware, you can afford to run many models many times on each file. A serious bug searching with Mythos or GPT 5.5 is likely to be very expensive, while likely to provide the same results in most cases.

•

dlahoda 19 hours ago

i casually asked gemini and codex 200usd subs to find and verify bugs for weeks. it did wrote tests, injected mutations, verified fixes. just promts.

also i had to proxy remote mainnet with localhost to force them to do penetration and dos testing.

mythos is nothing new.

•

nardons 22 hours ago

Do you have a source for this? Not doubting it, but I would like to have something concrete the next time the Mythos horse manure is cited.

•

skirmish 21 hours ago

Probably this: https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jag...

•

WalterGR 15 hours ago

Discussion:

https://news.ycombinator.com/item?id=47732020

“Small models also found the vulnerabilities that Mythos found” (aisle.com)

1,283 points | 12 days ago | 360 comments

•

mertcikla 22 hours ago

why does this read like an openai ad?

•

kibibu 19 hours ago

> GPT-5.5 doesn’t just improve — it pulls away

I think it's also self-aggrandizing.

•

immanuwell 12 hours ago

Those miss-rate numbers are genuinely eye-opening - dropping from 40% to 10% in what sounds like a single generation is no joke - though it's worth taking any vendor-adjacent benchmark with a grain of salt until the broader security community kicks the tires