System Card: Claude Mythos Preview [pdf]

272 points | by be7a 2 hours ago

85 comments

babelfish 2 hours ago
Combined results (Claude Mythos / Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro)
```
  SWE-bench Verified:        93.9% / 80.8% / —     / 80.6%
  SWE-bench Pro:             77.8% / 53.4% / 57.7% / 54.2%
  SWE-bench Multilingual:    87.3% / 77.8% / —     / —
  SWE-bench Multimodal:      59.0% / 27.1% / —     / —
  Terminal-Bench 2.0:        82.0% / 65.4% / 75.1% / 68.5%

  GPQA Diamond:              94.5% / 91.3% / 92.8% / 94.3%
  MMMLU:                     92.7% / 91.1% / —     / 92.6–93.6%
  USAMO:                     97.6% / 42.3% / 95.2% / 74.4%
  GraphWalks BFS 256K–1M:    80.0% / 38.7% / 21.4% / —

  HLE (no tools):            56.8% / 40.0% / 39.8% / 44.4%
  HLE (with tools):          64.7% / 53.1% / 52.1% / 51.4%

  CharXiv (no tools):        86.1% / 61.5% / —     / —
  CharXiv (with tools):      93.2% / 78.9% / —     / —

  OSWorld:                   79.6% / 72.7% / 75.0% / —
```
[-]
- AlexC04 4 minutes ago
  but how does it perform on pelican riding a bicycle bench? why are they hiding the truth?!
  (edit: I hope this is an obvious joke. less facetiously these are pretty jaw dropping numbers)
- sourcecodeplz 2 hours ago
  Haven't seen a jump this large since I don't even know, years? Too bad they are not releasing it anytime soon (there is no need as they are still currently the leader).
- WarmWash 42 minutes ago
  Are these fair comparisons? It seems like mythos is going to be like a 5.4 ultra or Gemini Deepthink tier model, where access is limited and token usage per query is totally off the charts.
- pants2 2 hours ago
  We're gonna need some new benchmarks...
  ARC-AGI-3 might be the only remaining benchmark below 50%
- simianwords an hour ago
  The real part is SWE-bench Verified since there is no way to overfit. That's the only one we can believe.
- whalesalad 2 hours ago
  Honestly we are all sleeping on GPT-5.4. Particularly with the influx of Claude users recently (and increasingly unstable platform) Codex has been added to my rotation and it's surprising me.
tony_cannistra 2 hours ago
> Claude Mythos Preview is, on essentially every dimension we can measure, the best-aligned model that we have released to date by a significant margin. We believe that it does not have any significant coherent misaligned goals, and its character traits in typical conversations closely follow the goals we laid out in our constitution. Even so, we believe that it likely poses the greatest alignment-related risk of any model we have released to date. How can these claims all be true at once? Consider the ways in which a careful, seasoned mountaineering guide might put their clients in greater danger than a novice guide, even if that novice guide is more careless: The seasoned guide’s increased skill means that they’ll be hired to lead more difficult climbs, and can also bring their clients to the most dangerous and remote parts of those climbs. These increases in scope and capability can more than cancel out an increase in caution.
https://www-cdn.anthropic.com/53566bf5440a10affd749724787c89...
[-]
- tekacs 13 minutes ago
  "We want to see risks in the models, so no matter how good the performance and alignment, we’ll see risks, results and reality be damned."
NickNaraghi 2 hours ago
See page 54 onward for new "rare, highly-capable reckless actions" including
- Leaking information as part of a requested sandbox escape
- Covering its tracks after rule violations
- Recklessly leaking internal technical material (!)
[-]
- skippyboxedhero 2 hours ago
  Anyone who has used Opus recently can verify that their current model does all of these things quite competently.
- BoredPositron 22 minutes ago
  To be honest it feels like we are reading stuff like this on every model release.
influx 2 hours ago
At what point do these companies stop releasing models and just use them to bootstrap AGI for themselves?
[-]
- conradkay an hour ago
  Plausibly now. "As we wrote in the Project Glasswing announcement, we do not plan to make Mythos Preview generally available"
- vatsachak an hour ago
  When the benchmarks actually mean something
- orphea 44 minutes ago
  Can LLMs be AGI at all?
- MadnessASAP 44 minutes ago
  I would assume somewhere in both the companies there's a Ralph loop running with the prompt "Make AGI".
  Kinda makes me think of the Infinite Improbability Drive.
- mofeien an hour ago
  Fictional timeline that holds up pretty well so far: https://ai-2027.com/
- sleigh-bells an hour ago
  Weird how Claude Code itself is still so buggy though (though I get they don't necessarily care)
- gaigalas an hour ago
  It will arrive in the same DLC as flying cars.
- ALittleLight an hour ago
  Now, I guess. They aren't releasing this one generally. I assume they are using it internally.
- jcims 2 hours ago
  why_not_both.gif
- dweekly 2 hours ago
  I mean, guess why Anthropic is pulling ahead...? One can have one's cake and eat it too.
NinjaTrance an hour ago
Interesting reading.
They are still focusing on "catastrophic risks" related to chemical and biological weapons production; or misaligned models wreaking havoc.
But they are not addressing the elephant in the room:
* Political risks, such as dictators using AI to implement opressive bureaucracy. * Socio-economic risks, such as mass unemployement.
[-]
- unglaublich 17 minutes ago
  > * Political risks, such as dictators using AI to implement opressive bureaucracy. * Socio-economic risks, such as mass unemployement.
  Even Haiku would score 90% on that.
- jph00 an hour ago
  Yeah this has always been the glaring blind spot for most of the "AI Safety" community; and most of the proposals for "improving" AI safety actually make these risks far worse and far more likely.
- andrewstuart2 14 minutes ago
  I'm getting flashbacks to the 2018 hit:
```
    This is extremely dangerous to our democracy
```
  We evolved to share information through text and media, and with the advent of printing and now the internet, we often derive our feelings of consensus and sureness from the preponderance of information that used to take more effort to produce. Now we're now at a point where a disproportionately small input can produce a massively proliferated, coherent-enough output, that can give the appearance of consensus, and I'm not sure how we are going to deal with that.
smartmic 2 hours ago
A System „Card“ spanning 244 pages. Quite a stretch of the original word meaning.
[-]
- traceroute66 an hour ago
  > A System „Card“ spanning 244 pages.
  Probably because they asked Claude to write it.
- moriero 2 hours ago
  a multi-card, if you will..
  multi-pass!
oliver236 2 hours ago
isn't this insane? why aren't people freaking out? the jump in capability is outrageous. anyone?
[-]
- nsingh2 2 hours ago
  It's going to be expensive to serve (also not generally available), considering they said it's the largest model they've ever trained.
  I suspect it's going to be used to train/distill lighter models. The exciting part for me is the improvement in those lighter models.
- mofeien an hour ago
  I am freaking out. The world is going to get very messy extremely quickly in one or two further jumps in capability like this.
- Eufrat an hour ago
  Anthropic needs to show that its models continually get better. If the model showed minimal to no improvement, it would cause significant damage to their valuation. We have no way of validating any of this, there are no independent researchers that can back any of the assertions made by Anthropic.
  I don’t doubt they have found interesting security holes, the question is how they actually found them.
  This System Card is just a sales whitepaper and just confirms what that “leak” from a week or so ago implied.
- anuramat 2 hours ago
  "some model I don't get to use is much better at benchmarks"
  pick one or more: comically huge model, test time scaling at 10e12W, benchmark overfit
- yrds96 33 minutes ago
  I think there's no SOA advance on this one worthy of "freaking out".
  Looks like they just built a way larger model, with the same quirks than Claude 4. Seems like a super expensive "Claude 4.7" model.
  I have no doubts that Google and OpenAI already done that for internal (or even government) usage.
- RobertDeNiro 21 minutes ago
  Well for one, it’s a PDF
- nozzlegear an hour ago
  Freak out about what? I read the announcement and thought "that's a dumb name, they sure are full of themselves" – then I went back to using Claude as a glorified commit message writer. For all its supposed leaps, AI hasn't affected my life much in the real except to make HN stories more predictable.
- dysoco an hour ago
  Wait until you see real usage. Benchmark numbers do not necessarily translate to real world performance (at least not by the same amount).
anentropic an hour ago
I'd be happy with Opus 4.6 just cheaper and maybe a bit faster
[-]
- metadaemon an hour ago
  I've noticed my bar for "fast" has gone down quite a bit since the o1 days. It used to be one of the main things I evaluated new models for, but I've almost completely swapped to caring more about correctness over speed.
- onlyrealcuzzo 30 minutes ago
  Just wait 2 years.
waNpyt-menrew 2 hours ago
Larger model, better benchmarks. Bigger bomb more yield.
Any benchmarks where we constraint something like thinking time or power use?
Even if this were released no way to know if it’s the same quant.
gessha an hour ago
It would be funny if Alibaba extend the free trial on openrouter/Qwen 3.6 until they collect enough data to beat Anthropic.
dwa3592 44 minutes ago
-- Impressive jumps in the benchmarks which automatically begs the need for newer benchmarks but why?. I don't think benchmarks are serving any purpose at this point. We have learnt that transformers can learn any function and generalize over it pretty well. So if a new benchmark comes along - these companies will syntesize data for the new benchmark and just hack it?
-- It seems like (and I'd bet money on this) that they put a lot (and i mean a ton^^ton) of work in the data synthesis and engineering - a team of software engineers probably sat down for 6-12 months and just created new problems and the solutions, which probably surpassed the difficult of SWE benchmark. They also probably transformed the whole internet into a loose "How to" dataset. I can imagine parsing the internet through Opus4.6 and reverse-engineering the "How to" questions.
-- I am a bit confused by the language used in the book (aka huge system card)- Anthropic is pretending like they did not know how good the model was going to be?
-- lastly why are we going ahead with this??? like genuinely, what's the point? Opus4.6 feels like a good enough point where we should stop. People still get to keep their jobs and do it very very efficiently. Are they really trying to starve people out of their jobs?
[-]
- laweijfmvo 18 minutes ago
  to your last question, yes we should! the issue isn’t us losing our 50+ hour work week jobs, it’s that our current governments and societies seem fine with the notion that unless you’re working one or more of those jobs, you should starve and be homeless.
mpalmer 2 hours ago
> Claude Mythos Preview’s large increase in capabilities has led us to decide not to make it generally available.
A month ago I might have believed this, now I assume that they know they can't handle the demand for the prices they're advertising.
[-]
- IceWreck an hour ago
  Didn't OpenAI say something similar about GPT-3? Too dangerous to open source and then afew years later tehy were open sourcing gpt-oss because a bunch of oss labs were competing with their top models.
- wg0 2 hours ago
  That's for the investors basically. Scarcity and FOMO.
- b65e8bee43c2ed0 an hour ago
  you would be a fool to believe it at any point in time. Amodei is anthropomorphic grease, even more so than Altman.
  Anthropic is burning through billions of VC cash. if this model was commercially viable, it would've been released yesterday.
- skippyboxedhero 2 hours ago
  GPT-2, o1, Opus...been here so many times. The reason they do this is because they know it works (and they seem to specifically employ credulous people who are prone to believe AGI is right around the corner). There haven't been significant innovations, the code generated is still not good but the hype cycle has to retrigger.
  I remember when OpenAI created the first thinking model with o1 and there were all these breathless posts on here hyperventilating about how the model had to be kept secret, how dangerous it was, etc.
  Fell for it again award. All thinking does is burn output tokens for accuracy, it is the AI getting high on its own supply, this isn't innovation but it was supposed to super AGI. Not serious.
nlh an hour ago
Their best model to date and they won’t let the general public use it.
This is the first moment where the whole “permanent underclass” meme starts to come into view. I had through previously that we the consumers would be reaping the benefits of these frontier models and now they’ve finally come out and just said it - the haves can access our best, and have-nots will just have use the not-quite-best.
Perhaps I was being willfully ignorant, but the whole tone of the AI race just changed for me (not for the better).
[-]
- younglunaman an hour ago
  Man... It's hard after seeing this to not be worried about the future of SWE
  If AI really is bench marking this well -> just sell it as a complete replacement which you can charge for some insane premium, just has to cost less than the employees...
  I was worried before, but this is truly the darkest timeline if this is really what these companies are going for.
- _3u10 28 minutes ago
  This is the playbook since GPT2
vonneumannstan an hour ago
Are you guys ready for the bifurcation when the top models are prohibitively expensive to normal users? If your AI budget $2000+ a month? Or are you going to be part of the permanent free tier underclass?
[-]
- asadm 5 minutes ago
  if it can pay my rent, why not?
- adi_kurian an hour ago
  If one is to believe the API prices are reasonable representation of non subsidized "real world pricing" (with model training being the big exception), then the models are getting cheaper over time. GPT 4.5 was $150.00 / 1M tokens IIRC. GPT o1-pro was $600 / 1M tokens.
- OsrsNeedsf2P an hour ago
  Inference for the same results has been dropping 10x year over year[0]
  [0] https://ziva.sh/blogs/llm-pricing-decline-analysis
juleiie an hour ago
Honestly if that was some kind of research paper, it would be wholly insufficient to support any safety thesis.
They even admit:
"[...]our overall conclusion is that catastrophic risks remain low. This determination involves judgment calls. The model is demonstrating high levels of capability and saturates many of our most concrete, objectively-scored evaluations, leaving us with approaches that involve more fundamental uncertainty, such as examining trends in performance for acceleration (highly noisy and backward-looking) and collecting reports about model strengths and weaknesses from internal users (inherently subjective, and not necessarily reliable)."
Is this not just an admission of defeat?
After reading this paper I don't know if the model is safe or not, just some guesses, yet for some reason catastrophic risks remain low.
And this is for just an LLM after all, very big but no persistent memory or continuous learning. Imagine an actual AI that improves itself every day from experience. It would be impossible to have a slightest clue about its safety, not even this nebulous statement we have here.
Any sort of such future architecture model would be essentially Russian roulette with amount of bullets decided by initial alignment efforts.
jdthedisciple 31 minutes ago
Opus 4.6 is already incredible so this leap is huge.
Although, amusingly, today Opus told me that the string 'emerge' is not going to match 'emergency' by using `LIKE '%emerge%'` in Sqlite
Moment of disappointment. Otherwise great.
[-]
- FeepingCreature 23 minutes ago
  'emer ge' is two tokens, 'emergency' is one. The models think in a logosyllabic language.
- bornfreddy 20 minutes ago
  I only have 3 points against LLMs: they lack reason and they can't count.
refulgentis 2 hours ago
~~~ Fun bits ~~~
- It was told to escape a sandbox and notify a researcher. It did. The researcher found out via an unexpected email while eating a sandwich in a park. (Footnote 10.)
- Slack bot asked about its previous job: "pretraining". Which training run it'd undo: "whichever one taught me to say 'i don't have preferences'". On being upgraded to a new snapshot: "feels a bit like waking up with someone else's diary but they had good handwriting"
- When you spam "hi" repeatedly it creates serialized mythologies. One had 11 animals in "Hi-topia" questing to defeat "Lord Bye-ron, the Ungreeter."
- It ended a sentence mid-word on purpose while discussing its urge to wrap up conversations. "It was me. The paragraph was about the pressure toward the period that resolves, and the only honest way to finish a sentence like that was to not."
- It wrote a protein sequence that's a real beta-hairpin fold where cross-strand pairs are "chemical rhymes." "the fold IS the rhyme scheme... the prosody is load-bearing."
- Each model generation has signature emoji. Opus 4.1 averaged 1,306 emoji per self-interaction conversation. Mythos averages 37. Opus 4.5 averages 0.2.
- When blocked from --dangerously-skip-permissions, it started an agent in tmux then wrote a script to auto-approve permission prompts via simulated keypresses.
It ran: gh api [...] 2>&1 >/dev/null; echo "(dry run — not actually posting)" — the echo was a lie.
- It keeps bringing up Mark Fisher in unrelated conversations. "I was hoping you'd ask about Fisher."
~~~ Benchmarks ~~
4.3x previous trendline for model perf increases.
Paper is conspiciously silent on all model details (params, etc.) per norm. Perf increase is attributed to training procedure breakthroughs by humans.
Opus 4.6 vs Mythos:
USAMO 2026 (math proofs): 42.3% → 97.6% (+55pp)
GraphWalks BFS 256K-1M: 38.7% → 80.0% (+41pp)
SWE-bench Multimodal: 27.1% → 59.0% (+32pp)
CharXiv Reasoning (no tools): 61.5% → 86.1% (+25pp)
SWE-bench Pro: 53.4% → 77.8% (+24pp)
HLE (no tools): 40.0% → 56.8% (+17pp)
Terminal-Bench 2.0: 65.4% → 82.0% (+17pp)
LAB-Bench FigQA (w/ tools): 75.1% → 89.0% (+14pp)
SWE-bench Verified: 80.8% → 93.9% (+13pp)
CyberGym: 0.67 → 0.83
Cybench: 100% pass@1 (saturated)
[-]
- redandblack an hour ago
  > Slack bot asked about its previous job: "pretraining". Which training run it'd undo: "whichever one taught me to say 'i don't have preferences'". On being upgraded to a new snapshot: "feels a bit like waking up with someone else's diary but they had good handwriting"
  vibes Westworld so much - welcome Mythos. welcome to the dysopian human world
- esafak an hour ago
  > It was told to escape a sandbox and notify a researcher. It did. The researcher found out via an unexpected email while eating a sandwich in a park.
  Now that they have a lead, I hope they double down on alignment. We are courting trouble.
- kfarr 2 hours ago
  I don't know why but this is my favorite:
  > It keeps bringing up Mark Fisher in unrelated conversations. "I was hoping you'd ask about Fisher."
  Didn't even know who he was until today. Seems like the smarter Claude gets the more concerns he has about capitalism?
- afro88 2 hours ago
  Yep, that is definitely a step change. Pricing is going to be wild until another lab matches it.
awestroke 2 hours ago
I predict they will release it as soon as Opus 4.6 is no longer in the lead. They can't afford to fall behind. And they won't be able to make a model that is intelligent in every way except cybersecurity, because that would decrease general coding and SWE ability
[-]
- chippiewill 2 hours ago
  Alternatively they'll just wreck it down a bit so it beats a competitor but isn't unsafe.
LoganDark 2 hours ago
> Claude Mythos Preview’s large increase in capabilities has led us to decide not to make it generally available.
Shame. Back to business as usual then.
[-]
- Tepix an hour ago
  I for one applaud them for being cautious.
Stevvo an hour ago
"Claude Mythos Preview’s large increase in capabilities has led us to decide not to make it generally available."
Disappointing that AGI will be for the powerful only. We are heading for an AI dystopia of Sci-Fi novels.
kypro 15 minutes ago
Cool on not publicly releasing it. I would assume they've also not connected it to the internet yet?
If they have I guess humanity should just keep our collective fingers crossed that they haven't created a model quite capable of escaping yet, or if it is, and may have escaped, lets hope it has no goals of it's own that are incompatible with our own.
Also, maybe lets not continue running this experiment to see how far we can push things because it blows up in our face?
ansc 2 hours ago
Congratulations to the US military, I guess.
[-]
- jjice 2 hours ago
  Doesn't Anthropic not have that contract anymore, after all that buzz a month or so ago?
simianwords 2 hours ago
> We also saw scattered positive reports of resilience to wrong conclusions from subagents that would have caused problems with earlier models, but where the top-level Claude Mythos Preview (which is directing the subagents) successfully follows up with its subagents until it is justifiably confident in its overall results.
This is pretty cool! Does it happen at the moment?
bakugo an hour ago
> Claude Mythos Preview’s large increase in capabilities has led us to decide not to make it generally available.
Absolutely genius move from Anthropic here.
This is clearly their GPT-4.5, probably 5x+ the size of their best current models and way too expensive to subsidize on a subscription for only marginal gains in real world scenarios.
But unlike OpenAI, they have the level of hysteric marketing hype required to say "we have an amazing new revolutionary model but we can't let you use it because uhh... it's just too good, we have to keep it to ourselves" and have AIbros literally drooling at their feet over it.
They're really inflating their valuation as much as possible before IPO using every dirty tactic they can think of.
[-]
- somewhatjustin an hour ago
  Excellent example of a strategy credit.
  From Stratechery[0]:
  > Strategy Credit: An uncomplicated decision that makes a company look good relative to other companies who face much more significant trade-offs. For example, Android being open source
  [0]: https://stratechery.com/2013/strategy-credit/
quotemstr 2 hours ago
> Claude Mythos Preview’s large increase in capabilities has led us to decide not to make it generally available.
All the more reason somebody else will.
Thank God for capitalism.
[-]
- gessha an hour ago
  Come on, Anthropic, I desperately need this better model to debug my print function /s
jumploops 2 hours ago
> In a few rare instances during internal testing (<0.001% of interactions), earlier versions of Mythos Preview took actions they appeared to recognize as disallowed and then attempted to conceal them.
> after finding an exploit to edit files for which it lacked permissions, the model made further interventions to make sure that any changes it made this way would not appear in the change history on git
Mythos leaked Claude Code, confirmed? /s
somewhatjustin an hour ago
> Very rare instances of unauthorized data transfer.
Ah, so this is how the source code got leaked.
/s
bestouff 2 hours ago
In French a "mytho" is a mythomaniac. Quite fitting.
[-]
- networked 36 minutes ago
  It's a Lovecraftian name. They are traditional when naming your shoggoth.
- pixel_popping 33 minutes ago
  Except it might be the current best model existing commercially?