A bug is a bug. A “potential vulnerability” is a bug. A vulnerability is verifiable as having security implications with a proof of concept or other substantial evidence.
Words matter. Bugs matter. It’s important to fix large amounts of bugs, just as it always has been, and has been done. Let that be impressive on its own, because it IS impressive.
Mythos didn’t write 271 PoC for vulnerabilities and demonstrate code path reachability with security implications. Mythos found 271 valid bugs. Let that be enough.
I was a bit confused by your definitions, but here's how Mozilla broke out [1] the 271, um, things:
> As additional context, we apply security severity ratings from critical to low to indicate the urgency of a bug:
> * sec-critical and sec-high are assigned to vulnerabilities that can be triggered with normal user behavior, like browsing to a web page. We make no technical difference between these, but sec-critical bugs are reserved for issues that are publicly disclosed or known to be exploited in the wild.
> * sec-moderate is assigned to vulnerabilities that would otherwise be rated sec-high but require unusual and complex steps from the victim.
> * sec-low is assigned to bugs that are annoying but far from causing user harm (e.g, a safe crash).
> Of the 271 bugs we announced for Firefox 150: 180 were sec-high, 80 were sec-moderate, and 11 were sec-low.
Mozilla uses the term "vulnerability" for even sec-high, even though they say right below that it doesn't mean the same thing as a practical exploit. And on their definitional page, they classify even sec-low as "vulnerabilities" [2].
Words are tools, that get their utility from collective meaning. I'd be interested where you recieved your semantics from and if they match up or disagree with Mozilla.
Mythos did in fact write PoCs for all bugs that crash with demonstration of memory-unsafe behavior (e.g. use-after-free, out-of-bounds reads/writes, etc).
For us this is substantial enough evidence to consider it a security vulnerability at that point, unless shown otherwise and it has always been this way (also for fuzzing bugs).
I dismissed the earlier non-technical blog post as shameless product boosterism for Anthropic. The linked hacks blog (which is a better source than this article) is a welcome release. It's hard to deny there's something real to this now, I think. Mozilla's internal definition of a "vulnerability" is also probably more widely applied than what many would intuit, but it is good that these issues are being taken seriously and fixed.
It's better because it actually lists a sample of Bugzilla reports that were made public. This topic was discussed previously (36 comments two weeks ago: https://news.ycombinator.com/item?id=47885042), but the part about bug reports being made public is brand new.
I hope to see the day when (or if) the LLMs get so good at spotting and fixing bugs that all that’s left for the Firefox engineers to do is to focus on adding new features.
This isn’t sarcasm. Firefox deserves to be used more. Most people I know don’t use it because “Chrome does almost everything better”, and Firefox can’t compete with the other browsers’ roadmaps.
Chrome ain't better in any meaningful way for >99% of use cases. Heck, I am a dev and I use FF with ublock origin only, for past... 10 years?
Same with my wife, after I've explained things to her and she understood how different internet experience can be thats the primary browser.
So please don't put the argument like 'here is crappy underdog but please use it because monopoly is bad and google is a bit evil', its first class experience in everything I have ever thrown at it. Tripple that on mobile, by far the best mobile and useful mobile experience, bar none.
Totally agree. I even go as far as choosing which website I make purchases on depending if they work on FF, or writing to support occasionally to tell them it's not supported or a feature isn't working properly and this would be appreciated.
I know it pretty much always goes nowhere, but I feel it's what I can do to keep the browser somehow on the radar.
Part of the problem is, when they stop working on fixing bugs, they start doing Mr Robot things... We just want a web browser. Nobody asked for pocket, or AI...
If they use AI to fix all the bugs, then what else is for them to do, other than maintain syntax compatibility with the various languages they build with? They're just going to go back to making the browser trash again.
Wouldn't that quality and availability just allow Chrome to pull ahead of Firefox that much faster?
If Mozilla created some proprietary LLM or harness that they used internally to outpace Chrome that may be a different story, though I also don't see that happening.
Unfortunately we're probably still quite far from that. This is the best case for LLMs still - the quality of their output didn't matter as long as it worked, and there was a near-perfect oracle for checking if their output worked.
That's a really good use case for LLMs. It also applies to things like finding proofs in Lean and creating test stimulus. In both cases you know automatically whether the output is good, and it doesn't really matter if it isn't.
That isn't the case for most bugs, and definitely isn't the case for actually fixing bugs.
They've only linked a few tickets, so of course maybe when we see all 271 actual distinct things the insight won't apply but all those I examined ended up as some C++ code with a nasty bug in it.
Firefox is written in several languages, only about 25% of it is in C++ but every single one of these issues seems to touch the C++.
A general limitation of this approach is that it is only as good as your validator, and there's nothing easier to validate than a test case that creates, say, an AddressSanitizer use-after-free. For subtler issues will we have to more specific validators or will the LLM become better at coming up with other dangerous conditions it will verify? We'll see.
It's possible Mythos is a lot better at finding vulnerabilities in C++ code than it is for other languages. After all, these models are also based on pre-existing security analysis.
From what I can tell, a lot of these bugs were hardly C++-specific, they just happened in C++ code. Even the most secure Rust can't magically catch things like TOCTOU issues.
Reading this article in the context of the Zig folks refusing to even consider LLM-generated bugs certainly shapes my perspective on what technologies will be in my toolchain.
Zig devs can run Mythos same as anyone else can. I think you are failing to understand the reasoning behind their decision. It's about contributors, not contributions.
Both are right and it depends on which model you use and who submits those bugs. The capabilities of leading models went from 99% noise to 99% valid bugs in essentially a few months. Some projects are flooded with the former and need to take precautions to avoid essential DoS attacks on the maintainers.
When I was at PalmSource, I tried to get budget for CoVerity or Fortify (static code analysis tools.). "Too expensive," my management chain said. I spent another year putting together a deal for a lower cost but limited to scanning the network stack. "No, it's based on BSD and BSD is inherently secure," my management chain said (neither is true, btw.)
I eventually left and wound up at Mozilla where there were a number of /* flawfinder ignore */ comments scattered throughout the code.
My guess is that Mythos just ignored the "flawfinder ignore" directives and reported the known vulnerabilities in the code.
What are people's thoughts on how this could affect static analysis tools? I know they are very different beats but often they achieve the same goal. Static analysis tools can be slow, and they report lots of false positives.
I wonder if these models will get good + cheap enough so that people rarely reach for static analysis.
LLMs are much better at using tools than replacing tools. The tools are generally a lot faster than trying to achieve the same result with an LLM.
Using LLM coding tools to stay on top of static analysis tool output works very well and adding some guard rails that enforce that there are no issues is probably a good idea. Just like adding CI checks to make sure everything is clean.
As for false positives, it depends on the tool. I tend to avoid tools that generate mostly noise. Most of these tools allow you to disable rules if they produce a lot of noise. Or you can just tell the LLM to fix all the issues. When it's cheaper to fix things than to argue with the rule, just fix it. That used to be really expensive when you had to do that manually. Now it isn't.
I recently did this to an Ansible code base that I needed to refresh after not touching it for a few years. It had hundreds of ansible-lint issues; mostly deprecation warnings and some non fatal other warnings. And 10 minutes later I had zero. Mostly they probably weren't very serious ones but it's a form of technical debt. If you have to fix hundreds of warnings manually, you are probably not going to do it. But if you can wave a magical wand and it all goes a way, why not? I adjusted the guard rails so it now always runs ansible-lint and fixes any issues. Only takes a few seconds extra.
One interesting possibility I've seen discussed[0] is enhancing the static analysis tools to find bug shapes that LLMs originally discover. There's no doubt static analysis tools are faster and cheaper, so they scale better to huge codebases and may allow generalizing the LLM's methods.
I've been thinking about this. Static analysis tools can also be much faster and most are fully deterministic, so including them in CI can catch bugs or latent bugs before they have a chance to land.
I maintain a static analysis tool using in Firefox's CI. False positives have to be fixed or annotated as non-problems in order for you to land a patch in our tree. That means permitting zero positives (false or true), which is a strict threshold. This is a conscious tradeoff; it requires weakening the analysis and getting some false negatives (missed bugs) in order to keep the signal-to-noise ratio high enough that people don't just ignore it and annotate everything away, or stop running it. Nearly all static analysis tools have to do this balancing act.
AI, as commonly used, is given more leeway. It's kind of fundamental that it must be allowed to hallucinate false positives; that's the source of much of its power. Which means you need layers of verification and validation on top of it. It can be slow, you'll never be able to say "it catches 100% of the errors of this particular form: ...", and yet it catches somuchstuff.
Data point: my analysis didn't cover one case that I erroneously thought was unlikely to produce true positives (real bugs), and was more complex to implement than seemed worth the trouble. Opus or Mythos, I'm not sure which, started reporting vulnerabilities stemming from that case, so I scrambled and extended the analysis to cover the gap. It took me long enough to implement that by the time I had a full scan of the source tree, Claude had found every important problem that it reported. The static analysis found several others, and I still honestly don't know whether any of them could ever be triggered in practice.
I still think there's value in the static analysis. Some of those occurrences of the problematic pattern might be reachable now through paths too tricky for the AI to construct. Some of them might turn into real problems when other code changes. It seems worth having fixes for all of them now for both possibilities, and also for the lesser reason of not wanting the AI to waste time trying to exploit them. At the same time, clearly the cost/benefit balance has shifted.
They could also team up: if I relax my standards and allow my analysis to write an additional warnings report of suspected problems, with the clear expectation that they might be false alarms, then I could feed that list to an AI to validate them. Essentially, feed slop to the slop machine and have it nondeterministically filter out the diamonds in the rough.
In the latest Mission Impossible, saving the world depends on recovering the original software of an escaped superhuman AGI from a sunken Russian submarine. Luther writes a "poison pill" that given the original source will instantly one-shot the AI. We were left to wonder how this magical code could have been written, but now we know. Luthor just wrote a Mythos prompt that handed it the source code and asked for an immutable critical exploit.
It will probably wipe out a few categories of issues, which is probably a good thing. And those things that still are still insecure can also be translated to some other language.
Translating things to Rust manually was already a thing before LLMs came into the picture. Now with LLMs that's only going to get easier and faster. The long term value is going to come from getting on top of the mountain of technical debt in the form of existing C/C++ code bases that is responsible for the vast majority of memory exploits, buffer overflows, and other issues that despite decades of attention still are being found across major code bases on a regular basis.
Mozilla finding these issues comes on the back of a quarter century of some very competent engineers trying to do the right thing and using all the tools at their disposal to prevent these issues from happening. I have a lot of respect for that team and the contributions it has made over the years to improve tools, testing/verification practices, etc. The issue is not their effort or competence.
The job of taking an existing system that is well covered in test, well documented/specified, etc. and producing a new one that can function as a drop in replacement is now something that can be considered. A few years ago that would have translated into absolutely massive project cost and risk. Now it's something you can kick off on a Friday afternoon. Worst case it doesn't work, best case you end up with a much better implementation.
It's still early days. There are still a lot of quality issues with LLM generated code. But the success/fail rate will probably improve over time.
Kinda like home-improvement stores, power tools, easily available hardware and youtube tutorials led to both incredibly amazing and durable furniture, as well as janky, ugly and even dangerous furniture.
More tools for more people equals more stuff being made on a wider range.
In 5 years attackers have an advantage but in the long run I think more secure if developers use LLMs on software to find and fix all of the worse remotely exploitable bugs before release. LLMs are going to force devs to be much more security conscious.
One of the biggest issues in security historically imo is vendors who think, well nobody will ever find this bug so we can deprioritize fixing it. LLMs will prevent vendors lying to themselves which will lead to more secure software.
Really it was not the issue that Opus could not do all these, there was just no incentive to fix bugs. Mythos represented a real marketing use case, so yes thanks for spending money to fix this, but this is not sustainable.
> “That’s the key thing that has unlocked our ability to operate at the scale we’ve been operating at now,” he said. “It gives the engineer a crank they can pull that says: ‘Yep, this has the problem,’ and then you can iterate on the code and know clearly when you’ve fixed it and eventually land the test case in the tree such that you don’t regress it.”
I don't understand much of this paragraph:
* "a crank they can pull that says: ‘Yep, this has the problem,’": as in, ring an alarm? Does the LLM ring th alarm?
* "you can iterate on the code and know clearly when you’ve fixed it": Isn't that true of most bugs, assuming you do the normal thing and generate a test case? And I thought the LLM output test cases itself: "It will craft test cases. We have our existing fuzzing systems and tools to be able to run those tests" And are they claiming the LLM facilitates iterating?
* "and eventually land the test case in the tree": Don't you create the test case before the fix? And just a few words earlier they seemed to be working on the fix, not the test case. And see the prior point about test cases.
* "such that you don’t regress it.”: How is the LLM helping here?
Maybe I'm missing some fundamental unwritten assumption?
Let's see, how this will improve the daily soc work. I still don't see, what's the big difference between Mythos and Opus, security wise. I'm confident, that this kind of vul detection is a long-term improvement. But does specifically Mythos makes such a big difference to "normal" models? I would love to see, what's the actual difference.
> We fixed a total of 423 security bugs in releases in April.
And any one or two of them could have led to a random web ad stealing your ssh keys and installing a keylogger to get into your bank account.
I often wonder if we're taking the right route with computer security. Would we be better off having a whole virtual machine for every web page, application or service? Or even physically separate hardware you just vnc into?
This is great, and it reflects some of the changes I've seen in the changelogs of Firefox and many others that have utilize Mythos. I'm closely watching a supposed data wall for AI models and this is a clear indicator that AI capabilities can still become much more advanced even at this point in time. It makes me enthusiastic about future releases and optimizations. Thanks for sharing.
Again, and this is important:
A bug is a bug. A “potential vulnerability” is a bug. A vulnerability is verifiable as having security implications with a proof of concept or other substantial evidence.
Words matter. Bugs matter. It’s important to fix large amounts of bugs, just as it always has been, and has been done. Let that be impressive on its own, because it IS impressive.
Mythos didn’t write 271 PoC for vulnerabilities and demonstrate code path reachability with security implications. Mythos found 271 valid bugs. Let that be enough.
I was a bit confused by your definitions, but here's how Mozilla broke out [1] the 271, um, things:
> As additional context, we apply security severity ratings from critical to low to indicate the urgency of a bug:
> * sec-critical and sec-high are assigned to vulnerabilities that can be triggered with normal user behavior, like browsing to a web page. We make no technical difference between these, but sec-critical bugs are reserved for issues that are publicly disclosed or known to be exploited in the wild.
> * sec-moderate is assigned to vulnerabilities that would otherwise be rated sec-high but require unusual and complex steps from the victim.
> * sec-low is assigned to bugs that are annoying but far from causing user harm (e.g, a safe crash).
> Of the 271 bugs we announced for Firefox 150: 180 were sec-high, 80 were sec-moderate, and 11 were sec-low.
Mozilla uses the term "vulnerability" for even sec-high, even though they say right below that it doesn't mean the same thing as a practical exploit. And on their definitional page, they classify even sec-low as "vulnerabilities" [2].
Words are tools, that get their utility from collective meaning. I'd be interested where you recieved your semantics from and if they match up or disagree with Mozilla.
[1] https://hacks.mozilla.org/2026/05/behind-the-scenes-hardenin...
[2] https://wiki.mozilla.org/Security_Severity_Ratings/Client
Mythos did in fact write PoCs for all bugs that crash with demonstration of memory-unsafe behavior (e.g. use-after-free, out-of-bounds reads/writes, etc).
For us this is substantial enough evidence to consider it a security vulnerability at that point, unless shown otherwise and it has always been this way (also for fuzzing bugs).
It may be worth noting that Claude can and will (if it believes you own the code, at least) produce PoC exploits for exploitable bugs that it finds.
My only source for this is personal experience, and no, I can't share any evidence of it.
This isn’t true anywhere people have to make decisions about what to work on first.
> Mythos didn’t write 271 PoC for vulnerabilities
I think the word you're looking for is exploit?
I dismissed the earlier non-technical blog post as shameless product boosterism for Anthropic. The linked hacks blog (which is a better source than this article) is a welcome release. It's hard to deny there's something real to this now, I think. Mozilla's internal definition of a "vulnerability" is also probably more widely applied than what many would intuit, but it is good that these issues are being taken seriously and fixed.
> The linked hacks blog
https://hacks.mozilla.org/2026/05/behind-the-scenes-hardenin...
At the same time other companies like AISLE are matching Mythos on vulnerabilities using older models but their own harnass: https://aisle.com/blog/aisle-matches-anthropic-mythos-on-fre...
So while Mythos certainly is real I think you could do the same with Deepseek pro, GPT 5.5 etc...
Original source: https://news.ycombinator.com/item?id=48051079
It's better because it actually lists a sample of Bugzilla reports that were made public. This topic was discussed previously (36 comments two weeks ago: https://news.ycombinator.com/item?id=47885042), but the part about bug reports being made public is brand new.
I hope to see the day when (or if) the LLMs get so good at spotting and fixing bugs that all that’s left for the Firefox engineers to do is to focus on adding new features.
This isn’t sarcasm. Firefox deserves to be used more. Most people I know don’t use it because “Chrome does almost everything better”, and Firefox can’t compete with the other browsers’ roadmaps.
Chrome ain't better in any meaningful way for >99% of use cases. Heck, I am a dev and I use FF with ublock origin only, for past... 10 years?
Same with my wife, after I've explained things to her and she understood how different internet experience can be thats the primary browser.
So please don't put the argument like 'here is crappy underdog but please use it because monopoly is bad and google is a bit evil', its first class experience in everything I have ever thrown at it. Tripple that on mobile, by far the best mobile and useful mobile experience, bar none.
> Firefox deserves to be used more
Totally agree. I even go as far as choosing which website I make purchases on depending if they work on FF, or writing to support occasionally to tell them it's not supported or a feature isn't working properly and this would be appreciated.
I know it pretty much always goes nowhere, but I feel it's what I can do to keep the browser somehow on the radar.
> Firefox deserves to be used more
Part of the problem is, when they stop working on fixing bugs, they start doing Mr Robot things... We just want a web browser. Nobody asked for pocket, or AI...
If they use AI to fix all the bugs, then what else is for them to do, other than maintain syntax compatibility with the various languages they build with? They're just going to go back to making the browser trash again.
Browser haven't needed new features in a very long time. Extensions were supposed to be the solution to that.
Wouldn't that quality and availability just allow Chrome to pull ahead of Firefox that much faster?
If Mozilla created some proprietary LLM or harness that they used internally to outpace Chrome that may be a different story, though I also don't see that happening.
Unfortunately we're probably still quite far from that. This is the best case for LLMs still - the quality of their output didn't matter as long as it worked, and there was a near-perfect oracle for checking if their output worked.
That's a really good use case for LLMs. It also applies to things like finding proofs in Lean and creating test stimulus. In both cases you know automatically whether the output is good, and it doesn't really matter if it isn't.
That isn't the case for most bugs, and definitely isn't the case for actually fixing bugs.
They've only linked a few tickets, so of course maybe when we see all 271 actual distinct things the insight won't apply but all those I examined ended up as some C++ code with a nasty bug in it.
Firefox is written in several languages, only about 25% of it is in C++ but every single one of these issues seems to touch the C++.
A general limitation of this approach is that it is only as good as your validator, and there's nothing easier to validate than a test case that creates, say, an AddressSanitizer use-after-free. For subtler issues will we have to more specific validators or will the LLM become better at coming up with other dangerous conditions it will verify? We'll see.
It's possible Mythos is a lot better at finding vulnerabilities in C++ code than it is for other languages. After all, these models are also based on pre-existing security analysis.
From what I can tell, a lot of these bugs were hardly C++-specific, they just happened in C++ code. Even the most secure Rust can't magically catch things like TOCTOU issues.
It's because they verified the bugs using AddressSanitizer so by construction it was only ever going to find C++ bugs.
Reading this article in the context of the Zig folks refusing to even consider LLM-generated bugs certainly shapes my perspective on what technologies will be in my toolchain.
Zig devs can run Mythos same as anyone else can. I think you are failing to understand the reasoning behind their decision. It's about contributors, not contributions.
https://kristoff.it/blog/contributor-poker-and-ai/
Both are right and it depends on which model you use and who submits those bugs. The capabilities of leading models went from 99% noise to 99% valid bugs in essentially a few months. Some projects are flooded with the former and need to take precautions to avoid essential DoS attacks on the maintainers.
When I was at PalmSource, I tried to get budget for CoVerity or Fortify (static code analysis tools.). "Too expensive," my management chain said. I spent another year putting together a deal for a lower cost but limited to scanning the network stack. "No, it's based on BSD and BSD is inherently secure," my management chain said (neither is true, btw.)
I eventually left and wound up at Mozilla where there were a number of /* flawfinder ignore */ comments scattered throughout the code.
My guess is that Mythos just ignored the "flawfinder ignore" directives and reported the known vulnerabilities in the code.
The code is open. If you can prove that's the case you'll have a real news story...
Even a quick glance at the bugs revealed in the blog post would quickly disprove your theory.
What are people's thoughts on how this could affect static analysis tools? I know they are very different beats but often they achieve the same goal. Static analysis tools can be slow, and they report lots of false positives.
I wonder if these models will get good + cheap enough so that people rarely reach for static analysis.
LLMs are much better at using tools than replacing tools. The tools are generally a lot faster than trying to achieve the same result with an LLM.
Using LLM coding tools to stay on top of static analysis tool output works very well and adding some guard rails that enforce that there are no issues is probably a good idea. Just like adding CI checks to make sure everything is clean.
As for false positives, it depends on the tool. I tend to avoid tools that generate mostly noise. Most of these tools allow you to disable rules if they produce a lot of noise. Or you can just tell the LLM to fix all the issues. When it's cheaper to fix things than to argue with the rule, just fix it. That used to be really expensive when you had to do that manually. Now it isn't.
I recently did this to an Ansible code base that I needed to refresh after not touching it for a few years. It had hundreds of ansible-lint issues; mostly deprecation warnings and some non fatal other warnings. And 10 minutes later I had zero. Mostly they probably weren't very serious ones but it's a form of technical debt. If you have to fix hundreds of warnings manually, you are probably not going to do it. But if you can wave a magical wand and it all goes a way, why not? I adjusted the guard rails so it now always runs ansible-lint and fixes any issues. Only takes a few seconds extra.
One interesting possibility I've seen discussed[0] is enhancing the static analysis tools to find bug shapes that LLMs originally discover. There's no doubt static analysis tools are faster and cheaper, so they scale better to huge codebases and may allow generalizing the LLM's methods.
[0] https://lwn.net/Articles/1068968/
> What are people's thoughts on how this could affect static analysis tools?
I think these harnesses are _using_ static analysis tools, and probably will continue to do that.
I've been thinking about this. Static analysis tools can also be much faster and most are fully deterministic, so including them in CI can catch bugs or latent bugs before they have a chance to land.
I maintain a static analysis tool using in Firefox's CI. False positives have to be fixed or annotated as non-problems in order for you to land a patch in our tree. That means permitting zero positives (false or true), which is a strict threshold. This is a conscious tradeoff; it requires weakening the analysis and getting some false negatives (missed bugs) in order to keep the signal-to-noise ratio high enough that people don't just ignore it and annotate everything away, or stop running it. Nearly all static analysis tools have to do this balancing act.
AI, as commonly used, is given more leeway. It's kind of fundamental that it must be allowed to hallucinate false positives; that's the source of much of its power. Which means you need layers of verification and validation on top of it. It can be slow, you'll never be able to say "it catches 100% of the errors of this particular form: ...", and yet it catches so much stuff.
Data point: my analysis didn't cover one case that I erroneously thought was unlikely to produce true positives (real bugs), and was more complex to implement than seemed worth the trouble. Opus or Mythos, I'm not sure which, started reporting vulnerabilities stemming from that case, so I scrambled and extended the analysis to cover the gap. It took me long enough to implement that by the time I had a full scan of the source tree, Claude had found every important problem that it reported. The static analysis found several others, and I still honestly don't know whether any of them could ever be triggered in practice.
I still think there's value in the static analysis. Some of those occurrences of the problematic pattern might be reachable now through paths too tricky for the AI to construct. Some of them might turn into real problems when other code changes. It seems worth having fixes for all of them now for both possibilities, and also for the lesser reason of not wanting the AI to waste time trying to exploit them. At the same time, clearly the cost/benefit balance has shifted.
They could also team up: if I relax my standards and allow my analysis to write an additional warnings report of suspected problems, with the clear expectation that they might be false alarms, then I could feed that list to an AI to validate them. Essentially, feed slop to the slop machine and have it nondeterministically filter out the diamonds in the rough.
Food for thought...
In the latest Mission Impossible, saving the world depends on recovering the original software of an escaped superhuman AGI from a sunken Russian submarine. Luther writes a "poison pill" that given the original source will instantly one-shot the AI. We were left to wonder how this magical code could have been written, but now we know. Luthor just wrote a Mythos prompt that handed it the source code and asked for an immutable critical exploit.
> Anyone building software can start using a harness with a modern model to find bugs and harden their code today. We recommend getting started now.
From what I understand, that is a recipe for getting quickly banned by commercial LLM providers?
Curious if people think LLMs will lead to more secure or less secure software in five years.
It will probably wipe out a few categories of issues, which is probably a good thing. And those things that still are still insecure can also be translated to some other language.
Translating things to Rust manually was already a thing before LLMs came into the picture. Now with LLMs that's only going to get easier and faster. The long term value is going to come from getting on top of the mountain of technical debt in the form of existing C/C++ code bases that is responsible for the vast majority of memory exploits, buffer overflows, and other issues that despite decades of attention still are being found across major code bases on a regular basis.
Mozilla finding these issues comes on the back of a quarter century of some very competent engineers trying to do the right thing and using all the tools at their disposal to prevent these issues from happening. I have a lot of respect for that team and the contributions it has made over the years to improve tools, testing/verification practices, etc. The issue is not their effort or competence.
The job of taking an existing system that is well covered in test, well documented/specified, etc. and producing a new one that can function as a drop in replacement is now something that can be considered. A few years ago that would have translated into absolutely massive project cost and risk. Now it's something you can kick off on a Friday afternoon. Worst case it doesn't work, best case you end up with a much better implementation.
It's still early days. There are still a lot of quality issues with LLM generated code. But the success/fail rate will probably improve over time.
Both. The skilled will use them to find problems, the unskilled will use them to slopcode insecure software the skilled will have to fix.
Kinda like home-improvement stores, power tools, easily available hardware and youtube tutorials led to both incredibly amazing and durable furniture, as well as janky, ugly and even dangerous furniture.
More tools for more people equals more stuff being made on a wider range.
More secure software, but in the same way that the population is net healthier after a plague.
I’m just happy we’re talking about security.
That will make software safer alone.
That depends on which side has more money.
In 5 years attackers have an advantage but in the long run I think more secure if developers use LLMs on software to find and fix all of the worse remotely exploitable bugs before release. LLMs are going to force devs to be much more security conscious.
More secure, at least in the cases where the tools are properly applied.
But it also represents more easily available opportunities for blackhats to abuse against the projects where these tools were not being applied.
One of the biggest issues in security historically imo is vendors who think, well nobody will ever find this bug so we can deprioritize fixing it. LLMs will prevent vendors lying to themselves which will lead to more secure software.
Less secure because of all the ways attacks can scale out and hackers can contribute vulnerabilities to active projects.
Really it was not the issue that Opus could not do all these, there was just no incentive to fix bugs. Mythos represented a real marketing use case, so yes thanks for spending money to fix this, but this is not sustainable.
> “That’s the key thing that has unlocked our ability to operate at the scale we’ve been operating at now,” he said. “It gives the engineer a crank they can pull that says: ‘Yep, this has the problem,’ and then you can iterate on the code and know clearly when you’ve fixed it and eventually land the test case in the tree such that you don’t regress it.”
I don't understand much of this paragraph:
* "a crank they can pull that says: ‘Yep, this has the problem,’": as in, ring an alarm? Does the LLM ring th alarm?
* "you can iterate on the code and know clearly when you’ve fixed it": Isn't that true of most bugs, assuming you do the normal thing and generate a test case? And I thought the LLM output test cases itself: "It will craft test cases. We have our existing fuzzing systems and tools to be able to run those tests" And are they claiming the LLM facilitates iterating?
* "and eventually land the test case in the tree": Don't you create the test case before the fix? And just a few words earlier they seemed to be working on the fix, not the test case. And see the prior point about test cases.
* "such that you don’t regress it.”: How is the LLM helping here?
Maybe I'm missing some fundamental unwritten assumption?
Let's see, how this will improve the daily soc work. I still don't see, what's the big difference between Mythos and Opus, security wise. I'm confident, that this kind of vul detection is a long-term improvement. But does specifically Mythos makes such a big difference to "normal" models? I would love to see, what's the actual difference.
I'm curious about how did Mozilla do bug finding before Mythos? Did they use any non-AI bug finding tools?
> We fixed a total of 423 security bugs in releases in April.
And any one or two of them could have led to a random web ad stealing your ssh keys and installing a keylogger to get into your bank account.
I often wonder if we're taking the right route with computer security. Would we be better off having a whole virtual machine for every web page, application or service? Or even physically separate hardware you just vnc into?
16 day old story
Wired: Mozilla Used Anthropic's Mythos to Find and Fix 271 Bugs in Firefox (41 points, 18 comments) https://news.ycombinator.com/item?id=47853649
Ars: Mozilla: Anthropic's Mythos found 271 security vulnerabilities in Firefox 150 (33 points, 8 comments)https://news.ycombinator.com/item?id=47855384
This is great, and it reflects some of the changes I've seen in the changelogs of Firefox and many others that have utilize Mythos. I'm closely watching a supposed data wall for AI models and this is a clear indicator that AI capabilities can still become much more advanced even at this point in time. It makes me enthusiastic about future releases and optimizations. Thanks for sharing.
The flipside of this is that with AI and prompt injection attacks, you don't need a browser vulnerability to be pwned!
I'm having more problems with Firefox 150 than I have had with any other browser update in years. I think I might be the only one though?
Related:
The zero-days are numbered
https://news.ycombinator.com/item?id=47853277
I just hope they don't start ignoring human created bug reports, as there are still many that haven't been fixed for years.