You took the time to write out this comment. To the benefit of those who read it, please expand upon where the article is shallow and what content you miss.
I've been using Claude to work on a medium-sized (100+kLoc) codebase, and it's a great productivity multiplier. Putting hours into creating a good AGENTS file is more improved results a lot. I find that over time it picks up the codebase quite well. Tedious tasks that would take a day are now a matter of a few prompts.
Still... I'm not ready to give it more autonomy. Even as it gets high-level things quite well, I still look at the code, give feedback, and have 3-4 rounds of tweaks until I'm happy with it, and also happy that I stil feel I have a good handle on the codebase.
Try to quantify those 3-4 rounds of tweaks into a set of rules to put into your AGENTS. Instead of iterating, have it start over from AGENTS file and see if it's correct now.
Ngl, that’s gold right here. I’ve been trying to automate my sessions, and what I’ve found cool is that you can ask Claude about how to improve on how to ask Claude things, and from there ask Claude to iterate on your session cycles
No. Because they still hallucinate at times. Confuse things. Forget things. Or none of the above, as it is anthropomorphizing, but the result is the same. They can make incredible working one shots, you start to trust them, then you trust too much and .. feel the result.
Yes. I am fighting with the disobeyance of LLM on working through my pipeline commands. I believe these violations are caused by its hallucinations. So I am still developing a mechanical system to monitor agents’ behaviors automatically. I believe these routines and monitors will play as a set of scaffold to keep leading the LLM on the right way all the time.
I'm stuck on the usage "mulle times a week" which shows up twice in the context of the Claude team editing or contributing to a CLAUDE.md file. Is this an AI-generated artifact?
What happens when you have a codebase made with claude using this setup and claude is down for let's say 8 hours? Are you able to efficiently, smoothly and productively take over the codebase?
You could say the same thing about any always online software suite and it would be equally fair as we move into more agentic development workflows.
EX. Sure, you could go back to the old ways of using a drafting table for your engineering work if CAD went down but it would be exponentially slower…
Personally with my workflow I spend 30-60 minutes per Claude feature spec doc when I’m pair planning. If Claude goes down I would just prepare spec docs on my own until it came back online and then rapidly review them before calling the coding workflow.
> You could say the same thing about any always online software suite
But this is the reason "serious shops" do not use always online software and tools in critical parts of the SDLC. There is a difference between influencers/people on socials promoting things vs. reality where the expectation is that things don't just stop working because there is an internet outage or some 3rd party disruption
There are dozens of competitive models that can take over the job. It's just simply a matter of getting Claude (while it's running) to generate feature-parallel multi-agent LLM configurations which can hot-swap between LLM providers in the case that Claude has an outage.
Which nobody is doing, especially not people who vibe code products. Saying "just prepare for it" as an answer to "what do you do if", is not really enough when that "prepare for it" is very expensive (time, tokens, effort etc.).
For someone to do this, they would have to think for themselves, which I've also not seen much of in the vibe-coding space.
I assume it will be similar to when a person is out sick or on vacation. Another person on the team likely could take over the work for a day, but realistically it just sits until they're is back.
So work stops until Claude is back? What if Claude comes back and costs 10x the amount? The answer is obviously that you'll "bend over" and pay, because the AI vendor who convinced you that Claude is so great owns you, your codebase, and by extension your company now.
Or you point your Claude code at a different LLM provider. It's not complicated and there are lots of vendors (and in the open-weights space multiple vendors serving the same models competing on price). Sure DeepSeek 4 isn't quite Opus at the moment. But it's plenty good to do the work. We've got different competing front-end tools and different competing back-end providers. No one 'owns' your company. Maybe that will change as the market evolves and one of the frontier tools become so much better than one vendor will own the market. But that's not where we are now.
I have seen many many times in microcontroller forums posts from first timers in the liking of "hello sirs i have problem please show how to do this", followed by their own reply a few hours later asking again because they were holding up, where "this" was usually something really trivial, you just needed to read the docs and the rightful answer was "did you really not try anything in 6 hours?"
AI should enhance your skills. If it's down and your first though is to buy another sub from a different vendor this might be a skill issue. (I'm afraid every day that this will happen to me btw.)
The point is that, with a sufficiently complex setup (with skills, MCPs, prompts, etc.) the difference in AI models will impact the quality of work. You might not care now, but you might care when you have 2 million lines of code and zero idea whats going on.
The point is vendor lock-in. The vibe coding community has reinvented vendor lock-in and is bound to repeat every mistake associated with it.
Just use a fallback, like Codex CLI. Takes a little effort upfront to ensure your configuration is wired correctly for both harnesses, but it is pretty easy to get them 90% identical (there will almost always be some experimental / edge case features that differ across harnesses, but in my experience those are negligible in practice).
> Takes a little effort upfront to ensure your configuration is wired correctly for both harnesses, but it is pretty easy to get them 90% identical
You don't need to put in any effort, just get Claude (Codex CLI if Claude is down) to generate the multi-harness config for you.
You sound like you might be a beginner so let me help you out with some advice -- You can get your multi-harness configurations completely identical by simply telling Claude to research the Codex spec and eliminate all feature drift between your configs. Hope this helps.
I more meant feature-level differences. For instance, Claude Code has agent teams, and Codex CLI does not. Or for a while, Codex had "/goal" and Claude Code did not (though now Claude Code has it too). To your point, it is usually possible to polyfill these gaps either with custom code/skills/hooks or with third party plugins.
What happens when you have a codebase made with gcc for let's say 8 hours? Are you able to efficiently, smoothly and productively take over the assembly code?
You can use a local model, which will go down exactly as often as gcc will. We may still have hopeful notions of being able to understand the codebase, but the reality seems to be that the codebases we don't understand will be the ones that will win out in the market, because they'll be cheaper while still only having about as many bugs as they had when people wrote them.
Because you're better able to take over the codebase a local model wrote than one Claude wrote? The original question was about taking over an LLM-written codebase, it doesn't sound to me like the argument was about a codebase that Claude, specifically, wrote.
The number one power move I have is Nix integration. The availability of tooling, secrets, environment and the ability for the agent to modify its own environment is... well, I don't know how people live without it. I guess you guys still install things using commands and hope everything you need is present on the next machine? Developer machine, CI environment, deployment environment: They're all derived from a single source, and compiling and running always works on every machine.
In Claude I use /branch and /rename a lot (context checkpoints, fork, go back)
I use sandboxing almost exclusively: https://github.com/nix-tools/bubblebox -- it's a generalisation of Numtide's claudebox with a few fixes and some feature additions (more coming). This is best compared to always running your Claude in Docker containers, except there's no Docker runtime. Works fine in WSL and nix-darwin, too.
I also prefer giving it a VPS over a Docker container.
On my own machine I just give it a Linux User Namespace, i.e. soft virtualisation via "bubblewrap."
What Docker Compose and Linux User Namespaces provide that a VPS doesn't: You can easily mount extra directories from your developer host machine in read or read+write mode. With the VPS you (most likely) need it to clone all of your resources separately, which requires SSH keys, and now you're slowly building towards an independent agentic environment, which is definitely very nice, but time-consuming, compared to piggybacking on your developer environment. Definitely the direction I'm going.
nix develop ensures your dev env is the same as your build/test/prod env. At least with Python everything is a flurry of requirements.txt, Python versions, poetry, pyproject.toml, perhaps automated with direnvs, a hefty Dockerfile/docker-compose, and perhaps conda (ugh) along the way; lots of moving parts.
I have a project that's mostly Rust sprinkled with C++ libs and Python helpers and it's easier to manage than the average virtualenv. Everything builds with nix build, everything runs with nix run, profiler/debugger works, IDE detects everything on any of my computers, builds and links with CUDA on x86, aarch64, NixOS, MacOS, Ubuntu or Amazon Linux. nix build can even build a Docker image for the odd need of Docker, and I haven't tried but I'm convinced that if I import the flake on my nix-config it will be built into the SD card for my Raspberry Pi just fine.
It's even replaced Ansible for me, colmena all the way.
+100. I also dig fnox (encrypted-secrets-in-git) and hk (pre-hooks manager that is actually fast and stays out of the way) by the same author, pretty much default for any project I start nowadays.
The reliance on context to drive correct actions just doesn't work well. I am constantly wrestling with AI agents that do not do what you tell them. Every AI agent out there seems to suck in this regard, leaving it up to the user to build in their own guardrails. I have a bad feeling that nobody is working on an improved solution.
I’ve seen no reason to believe it’s even possible to solve this.
The worst thing about LLMs is they can pass the Turing test, leading people to believe they have an Asimov style robot instead of a very cool statistical model. It feels like they should be able to follow instructions or keep instructions from content separate, but that’s not what’s happening.
bun run test -- -t "test name" # Single suite
bun run test:file -- "glob" # Specific files
# 4. Lint before committing
bun run lint:file -- "file1.ts"
bun run lint
# 5. Before creating PR
bun run lint:claude && bun run test
```
I have these things in pre-commit, this way the targets are always ran and the agent is forced to fix them (I ask claude to commit changes). The agents are erratic and very often skip these steps. Anything that can be deterministic I keep as scripts.
Regarding commits; both codex and claude are terrible at writing them. I have in my user CLAUDE.md:
```
Pattern: `type(scope): message` where type is `fix`, `feat`, `chore`,
`docs`, `refactor`, or `style`; scope marks what is affected; message is a
short lowercased description.
Keep subject and body lines under 72 characters. Always write a body
explaining what, how, and why in continuous human-readable text. For fixes
include the error message being fixed. No first-person speech. Re-read the
actual git diff before writing — the message must describe what changed,
not what was planned.
Use following command to create commit:
```bash
git commit -F - <<'EOF'
type(scope): subject line
Body paragraph explaining what, how, and why.
EOF
```
```
Without it would write the body as a single long sentence; when asked to fix lines it would just insert \n (newlines), which were not respected and were instead just rendered as characters.
Another thing I find helpful is VOCABULARY.md. Very often the agent would assume (connect?) a different thing than what I had in mind, with VOCABULARY I make sure when I say "thing" claude and I have both the same "understading" (connection?) what "thing" is.
I mean at this point, you should just write a few deterministic orchestration scripts to automate away the boring parts and write the code yourself. Why are we wasting our time on making the wonder shit-machine work?
To understand a solution you must first understand the problem. If your whole company calls its customers "clients" but claude finds that confusing, I think it's probably easier to tell claude that then get everyone in the company to change how they talk.
This was very difficult to read.
We really need to snap out of letting LLMs write posts. Even if there is some added value in this post, the feeling of chewing sand is just distracting and unnecessary.
In the recent weeks, I think the harness/model came to a point that you can just ask it to do stuff and it just does. You can use plan mode, you can also use superpowers, or whatever other skill, but given that you'll review something anyway, why not work directly with code instead of silly amounts of md files?
I like having a spec file that is used to generate the code. It's more dense and easier to understand what the application is supposed do. Prior to AI Agents, I had a more complex relationships with requirements because not all devs updated them. I was confused if the spec or code was the correct behavior for any aspect of the application.
If you aren't using AI for code review in 2026, why would you even bother? High quality, error-free, better-than-human code generation AND review is available for cheaper than ever. Why are you wasting your life reading code you didn't even write?
Because it might not have done what I wanted it to do. Also, just as with normal code review, I’m not just looking at the code but the final product. Maybe I realize after that I asked it to do something that was wrong?
In the recent weeks I trust Claude less and less. Yes, you can ask it to do stuff and it does stuff. But if you do look what it did you will often find corners cut, work based on assumptions and not verification, a lot of stuff missed.
Even tests - it is common for it to write tests which in reality test nothing.
I’m getting into the agentic coding (I know, late to the party, and that’s been a good spot for my experience and use case), so I’m reading with interest. The first tip: “give Claude a way to verify its own work”.
So what’s the recommendation for Claude to have a feedback loop?
Because it’s not what follows in the article: _“Explore, then plan, then code.”, “Use plan mode…”, “Reference, do not describe.”_
In my experience, the biggest benefit comes from having good quality integration and unit tests that are easy for the agent to run on its own to verify its work against.
Best Claude Code daily-driver guide I’ve read. Though I’ve only read two. The “let Claude write rules for itself” CLAUDE.md pattern is the highest-ROI habit in there. Buth here’s the thing. The assumption underneath: this works when Claude mostly follows CLAUDE.md. Anthropic’s own engineering post from May 25 (https://www.anthropic.com/engineering/how-we-contain-claude) reports their telemetry shows ~93% of permission prompts get clicked through and ~17% of dangerous actions slip past the auto-mode filter.
Their conclusion: environment-layer containment first, then model-layer steering.
CLAUDE.md is the right configuration layer but it is not a containment layer. Worth thinking about whether your worst case is a lost afternoon or a lost database and all backups deleted, too: https://safebots.ai/compromise.html
But the more important point are the costs. People are starting to realize just how costly it can be to run agents without precomputing and caching: https://safebots.ai/costs.html and self-orchestrating agents can go up to 1000x: https://safebots.ai/kimi.html
How much time do you lose when doing things like "verify plan with a second clean agent" instead of just reading and fixing it yourself in 5 min? How much understanding do you lose? How do you manage to treat it "as an engineer" where it's clearly not there yet? How much time do you lose when it makes almost the same mistake, invents stuff or tries to gaslight you over and over? What about blood pressure?
I only use opus 4.7 and am on the 100$/mo plan. I usually make sure the context does not grow beyond 30-40% of the 1m tokens. On heavy coding days where I do something pretty similar to this, I would occasionally run into the five hour limit, but that happens like once per week and then it wouldn't take too long to reset. Note that I use caveman, but I'm not sure to what extent that really helps.
The post goes to the point. Somehow this must be buried in Anthropic's documentation but I miss this kind of back-to-basic posts. Even if they are LLM-penned.
The author’s claim that Claude is a multiplier for skill is probably true for now but it also feels like cope inspired by usability issues with Claude. The advice is all good, but none of it is especially clever or impressive or hard to grasp. The multiplier just comes from the fact that anthropic hadn’t taken this essay and several similar ones and incorporated their feedback into the product. This is a pretty shallow most of expertise that anthropic ought to automate in a week.
My complaint with anthropic is actually the opposite. They seem too focused on building this suite of products (because they want lock-in), but they can’t even get the availability and speed on their models in an acceptable state. It really does seem like they’re falling behind google and OpenAI at the moment.
I agree, I find that just telling claude to use the CLIs I would have used anyway in the prompt works just fine. Use gh to do X, use az to do Y, build using Z. The harness handles the rest. All these MCPs, Skills, plugins, etc are just noise
To me, this kind of talk exhibits the very cultish and con side of the whole genAI train. In a way, it does a poor job especially when the intent is positive about the technology, it sheds a bad look on it.
Generally, and more so with paid products, one should expect to get something that is ready to be used, tuned by who's selling it at the best of their efforts. Instead, this is basically saying that the product is actually not much more than an empty box, and that it is your responsibility to augment it with third-party plugins and markdown texts that make it finally useful. And you better be carefully selecting the skills you install, you don't want to end up with second tier material made by GithubInfluencerA, you definitely need the work of GithubInfluencerB.
In the end, it's what is giving companies fuel to keep the hype running, because it allows to counter every possible argument or doubt about the technology, especially the ones made in good faith. No matter the problem you're facing, the blame is definitely on you, the user, for not setting up the tool in the right way.
I'm struggling in a lot of ways in accepting LLMs, but if I'll ever come completely sold on them and take this technology seriously, it won't be before this mood has gone away.
I see this kind of first-gen coding agents a bit like the AI-era microsoft excel: you need to be a poweruser to use it correctly, otherwise you'll end up failing catastrophically. Hence the amount of different ways to use it.
Having an "unfinished" product is also a great marketing tool for companies like anthropic: each skill/plugin/guide that you see on the internet is boosting their SEO + social validation metrics.
I understand and sympathize with this point of view.
I would just say this: there is a difference between advice for using a product, and for _optimizing_ your use of a product. Between a user and a power user.
I think devs probably disproportionately like to see themselves as power users of any given tool, and thus with coding agents, there are 1000 "systems" being thrown out on GitHub on any given day. Generally speaking, it is safe to avoid these, especially if you're new to the tool.
But saying the fact that people are into optimizing their setups indicates some fundamental deficiency of the tool misses the point, I think.
Claude Code and Codex CLI (and OpenCode, and I'm sure many others) are _remarkably_ effective right out of the box. The teams behind these tools must make them _generically_ useful so that they are accessible to as many people, and as many use cases, as possible. That is part of why, when you become familiar with the tool, there is typically going to be a level of customization you can apply to it to optimize it for _your_ use cases, beyond the generic out of the box configuration.
Similarly, I don't think it would be fair to critique VS Code simply because most power users augment it with a suite of extensions. In fact, it's customizability/extensibility is part of what makes it great.
I absolutely understand the power user perspective. The point is not that, and maybe I wasn't clear enough in pointing it out.
Here, something different is going on instead of the usual "base tool is ok for 90% of use cases, remaining 10% is covered by plugins and extensions". A lot of developers are finding it difficult to commit to agentic coding workflows, feeling a stretch on a lot of different aspects.
Companies, with the help of a very prominent and vocal part of the web and social media community, are addressing every issue by simply blaming the users, saying it's their fault if they're not keeping up with all the alleged advancements in prompt strategies. See the whole "maybe you haven't tried it in the last two months, everything's changed now". While it's true that things have been moving very fast, the fundamental idea behind the technology is the same, and some concerns about it simply cannot be wiped away by scaling some factors.
To me, this kind of talk exhibits the very cultish and con side of the whole genAI train ... Generally, and more so with paid products, one should expect to get something that is ready to be used
Right like I bought an AWS EC2 m6a.metal instance expecting to get something that is ready to be used. Now being told to recite arcane "commands" from the cloud computing holy book. They claim their supposedly groundbreaking hypertext protocol isn't even accessible to mere mortals using a $6000/month EC2, the blame is definitely on you, the user, for not setting up the tool in the right way.
This sysadmin cloud cult is basically saying that the EC2 product is actually not much more than an empty box, and that it is your responsibility to augment it with third-party servers and interpreters and application source texts that make it finally useful. And you better be carefully selecting the tools you install.
Oh great! Another AI slop article about "working" with AI (= working for AI). Do you notice how much bloody work you put in the boring parts, only to leave out the most creative aspect of software engineering to a slot-machine?
> Delegate, do not pair-program. Cat Wu (Claude Code team): “The model performs best if you treat it like an engineer you’re delegating to, not a pair programmer you’re guiding line by line.” Write a crisp brief upfront, then let it run.
This is also how you get a slop codebase that you won’t easily understand.
It becomes a labyrinth that only the Agent knows.
It’s not a catastrophe when your making prototypes or projects like you see on X.
But if you are expanding your codebase or trying to build something more professional and maintainable. I find it important to explicitly spec things bit by bit so I can understand and some what keep my writing style in this codebase.
But this is only productive when you have a fast model otherwise it kills your chain of thought while you wait for the output.
If the model is slow, delegation is probably the only way.
I agree. In fact, computers in general are for lazy cretins who can't use a pen and paper. We got man into space calculating with a pen and paper, if it was good enough then, it is good enough now. I like your concept, it should go further, cars are for people too lazy to walk. Planes are for people too lazy to flap their arms. Video cameras are for people too lazy to draw each frame by hand in real time then play them in a hand cranked projector.
How many times can I read the same shallow guidance written by AI on using a coding agent? Good god when will it stop
Can't wait to learn more about how to vendor-lock-in myself really hard into not being able to code without the help of a specific corporation!
You took the time to write out this comment. To the benefit of those who read it, please expand upon where the article is shallow and what content you miss.
You’re absolutely right. Not many people see this sort of thing. They’re waiting for the load-bearing parts of the article to land.
/s
I've been using Claude to work on a medium-sized (100+kLoc) codebase, and it's a great productivity multiplier. Putting hours into creating a good AGENTS file is more improved results a lot. I find that over time it picks up the codebase quite well. Tedious tasks that would take a day are now a matter of a few prompts.
Still... I'm not ready to give it more autonomy. Even as it gets high-level things quite well, I still look at the code, give feedback, and have 3-4 rounds of tweaks until I'm happy with it, and also happy that I stil feel I have a good handle on the codebase.
Try to quantify those 3-4 rounds of tweaks into a set of rules to put into your AGENTS. Instead of iterating, have it start over from AGENTS file and see if it's correct now.
Ngl, that’s gold right here. I’ve been trying to automate my sessions, and what I’ve found cool is that you can ask Claude about how to improve on how to ask Claude things, and from there ask Claude to iterate on your session cycles
Understandable. You don’t want to lose control to your codebase and don’t trust LLM is competent in handling that fully.
No. Because they still hallucinate at times. Confuse things. Forget things. Or none of the above, as it is anthropomorphizing, but the result is the same. They can make incredible working one shots, you start to trust them, then you trust too much and .. feel the result.
Yes. I am fighting with the disobeyance of LLM on working through my pipeline commands. I believe these violations are caused by its hallucinations. So I am still developing a mechanical system to monitor agents’ behaviors automatically. I believe these routines and monitors will play as a set of scaffold to keep leading the LLM on the right way all the time.
I'm stuck on the usage "mulle times a week" which shows up twice in the context of the Claude team editing or contributing to a CLAUDE.md file. Is this an AI-generated artifact?
What happens when you have a codebase made with claude using this setup and claude is down for let's say 8 hours? Are you able to efficiently, smoothly and productively take over the codebase?
You could say the same thing about any always online software suite and it would be equally fair as we move into more agentic development workflows.
EX. Sure, you could go back to the old ways of using a drafting table for your engineering work if CAD went down but it would be exponentially slower…
Personally with my workflow I spend 30-60 minutes per Claude feature spec doc when I’m pair planning. If Claude goes down I would just prepare spec docs on my own until it came back online and then rapidly review them before calling the coding workflow.
>You could say the same thing about any always online software suite
Precisely. Every online-only solution is a huge risk i personally do not want to take, i've always done my best to use offline-only tools.
That may restrict me from the latest and greatest, but i prefer not to be left at mercy of any corpo
> You could say the same thing about any always online software suite
But this is the reason "serious shops" do not use always online software and tools in critical parts of the SDLC. There is a difference between influencers/people on socials promoting things vs. reality where the expectation is that things don't just stop working because there is an internet outage or some 3rd party disruption
What happens if you get up in the morning and your car won't start? Do you walk to work?
After 1 hour you asked the question, I am reading the replies and the conclusion is: no, they cannot.
There are dozens of competitive models that can take over the job. It's just simply a matter of getting Claude (while it's running) to generate feature-parallel multi-agent LLM configurations which can hot-swap between LLM providers in the case that Claude has an outage.
Which nobody is doing, especially not people who vibe code products. Saying "just prepare for it" as an answer to "what do you do if", is not really enough when that "prepare for it" is very expensive (time, tokens, effort etc.).
For someone to do this, they would have to think for themselves, which I've also not seen much of in the vibe-coding space.
I assume it will be similar to when a person is out sick or on vacation. Another person on the team likely could take over the work for a day, but realistically it just sits until they're is back.
So work stops until Claude is back? What if Claude comes back and costs 10x the amount? The answer is obviously that you'll "bend over" and pay, because the AI vendor who convinced you that Claude is so great owns you, your codebase, and by extension your company now.
Or you point your Claude code at a different LLM provider. It's not complicated and there are lots of vendors (and in the open-weights space multiple vendors serving the same models competing on price). Sure DeepSeek 4 isn't quite Opus at the moment. But it's plenty good to do the work. We've got different competing front-end tools and different competing back-end providers. No one 'owns' your company. Maybe that will change as the market evolves and one of the frontier tools become so much better than one vendor will own the market. But that's not where we are now.
In such a scenario, are you assuming Anthropic has a monopoly? Or are all LLM providers callusing on prices?
For substandard developers, yes, work stops.
I have seen many many times in microcontroller forums posts from first timers in the liking of "hello sirs i have problem please show how to do this", followed by their own reply a few hours later asking again because they were holding up, where "this" was usually something really trivial, you just needed to read the docs and the rightful answer was "did you really not try anything in 6 hours?"
Or simple economics kicks in, price/demand all that.
If hand coding pays better there will be plenty who can still do that.
AI should enhance your skills. If it's down and your first though is to buy another sub from a different vendor this might be a skill issue. (I'm afraid every day that this will happen to me btw.)
Claude Code CLI is just a software package, if Anthropic API is down you could always connect Deepseek/other provider API to Claude Code CLI...
The point is that, with a sufficiently complex setup (with skills, MCPs, prompts, etc.) the difference in AI models will impact the quality of work. You might not care now, but you might care when you have 2 million lines of code and zero idea whats going on.
The point is vendor lock-in. The vibe coding community has reinvented vendor lock-in and is bound to repeat every mistake associated with it.
A local model doesn't have downtime. No you can't be as hands off with it as something like Claude, but isn't that a good thing?
Just use a fallback, like Codex CLI. Takes a little effort upfront to ensure your configuration is wired correctly for both harnesses, but it is pretty easy to get them 90% identical (there will almost always be some experimental / edge case features that differ across harnesses, but in my experience those are negligible in practice).
> Takes a little effort upfront to ensure your configuration is wired correctly for both harnesses, but it is pretty easy to get them 90% identical
You don't need to put in any effort, just get Claude (Codex CLI if Claude is down) to generate the multi-harness config for you.
You sound like you might be a beginner so let me help you out with some advice -- You can get your multi-harness configurations completely identical by simply telling Claude to research the Codex spec and eliminate all feature drift between your configs. Hope this helps.
I more meant feature-level differences. For instance, Claude Code has agent teams, and Codex CLI does not. Or for a while, Codex had "/goal" and Claude Code did not (though now Claude Code has it too). To your point, it is usually possible to polyfill these gaps either with custom code/skills/hooks or with third party plugins.
What happens when you have a codebase made with gcc for let's say 8 hours? Are you able to efficiently, smoothly and productively take over the assembly code?
1. When and how would gcc go down?
2. How often do you think that happens, compared to Claude?
You can use a local model, which will go down exactly as often as gcc will. We may still have hopeful notions of being able to understand the codebase, but the reality seems to be that the codebases we don't understand will be the ones that will win out in the market, because they'll be cheaper while still only having about as many bugs as they had when people wrote them.
We're explicitly not talking about local models here; we're talking about Claude.
Because you're better able to take over the codebase a local model wrote than one Claude wrote? The original question was about taking over an LLM-written codebase, it doesn't sound to me like the argument was about a codebase that Claude, specifically, wrote.
The original question is:
> What happens when you have a codebase made with claude using this setup and claude is down for let's say 8 hours?
So: - A codebase made with Claude - Using this [Claude] setup - Claude is down
Brother, look at the first comment in the chain you replied to. It very specifically was about Claude.
Well, in that case, it's also very specifically about this guy's codebase, so none of us can really say anything on this.
GCC down? Did the AI rotten your brain that much?
How can you come up with such non sense.
The same thing as happens if I go to sleep for 8 hours.
wat?
Is this really a position you want to take in public with your real name and identity and everything plastered over your profile?
What can I say, we can't all be geniuses.
The number one power move I have is Nix integration. The availability of tooling, secrets, environment and the ability for the agent to modify its own environment is... well, I don't know how people live without it. I guess you guys still install things using commands and hope everything you need is present on the next machine? Developer machine, CI environment, deployment environment: They're all derived from a single source, and compiling and running always works on every machine.
In Claude I use /branch and /rename a lot (context checkpoints, fork, go back)
I use sandboxing almost exclusively: https://github.com/nix-tools/bubblebox -- it's a generalisation of Numtide's claudebox with a few fixes and some feature additions (more coming). This is best compared to always running your Claude in Docker containers, except there's no Docker runtime. Works fine in WSL and nix-darwin, too.
I just gave mine its own VPS. Maybe more expensive than Nix but it was very easy
I also prefer giving it a VPS over a Docker container.
On my own machine I just give it a Linux User Namespace, i.e. soft virtualisation via "bubblewrap."
What Docker Compose and Linux User Namespaces provide that a VPS doesn't: You can easily mount extra directories from your developer host machine in read or read+write mode. With the VPS you (most likely) need it to clone all of your resources separately, which requires SSH keys, and now you're slowly building towards an independent agentic environment, which is definitely very nice, but time-consuming, compared to piggybacking on your developer environment. Definitely the direction I'm going.
I just use docker and I don't feel I'm missing anything?
nix develop ensures your dev env is the same as your build/test/prod env. At least with Python everything is a flurry of requirements.txt, Python versions, poetry, pyproject.toml, perhaps automated with direnvs, a hefty Dockerfile/docker-compose, and perhaps conda (ugh) along the way; lots of moving parts.
I have a project that's mostly Rust sprinkled with C++ libs and Python helpers and it's easier to manage than the average virtualenv. Everything builds with nix build, everything runs with nix run, profiler/debugger works, IDE detects everything on any of my computers, builds and links with CUDA on x86, aarch64, NixOS, MacOS, Ubuntu or Amazon Linux. nix build can even build a Docker image for the odd need of Docker, and I haven't tried but I'm convinced that if I import the flake on my nix-config it will be built into the SD card for my Raspberry Pi just fine.
It's even replaced Ansible for me, colmena all the way.
Docker's ability to mount host directories in the container is really nice.
Maybe you have some premade tooling that helps provide persistency between container invocations.
But by default, closing your agent container and opening it again just wipes everything you didn't host-mount.
What I'm advocating is really just the same functionality without the Docker runtime, because Linux has namespaces.
Feels more like you're on your host system with exactly the minor variations you specify.
Making Docker feel like your host system is possible, but I just never felt at home.
yeah, you can use rocker --home --user -- $CI_IMAGE
For those who don't want the complexity of Nix, Mise is a good compromise
For those who don't know: Mise is a version manager (among other things), and is said to be an improvement over its predecessor, asdf:
https://mise.en.dev
https://asdf-vm.com
+100. I also dig fnox (encrypted-secrets-in-git) and hk (pre-hooks manager that is actually fast and stays out of the way) by the same author, pretty much default for any project I start nowadays.
Though I also use nix to manage my machines :-D
Awesome, both fnox and hk look very well-made.
How does fnox compare to sops?
How does hk compare to lefthook?
And does hk and fnox have a similar Nix integration as lefthook-nix and sops-nix?
I'm still hoping I don't need to make a better lefthook.
I kind of like sops-nix, not sure what's missing, really. Maybe fnox is similarly wholesome for non-Nix users.
I see that hk has a flake, so that's a good sign.
https://github.com/sudosubin/lefthook.nix
https://simonshine.dk/articles/lefthook-treefmt-direnv-nix/
The reliance on context to drive correct actions just doesn't work well. I am constantly wrestling with AI agents that do not do what you tell them. Every AI agent out there seems to suck in this regard, leaving it up to the user to build in their own guardrails. I have a bad feeling that nobody is working on an improved solution.
I’ve seen no reason to believe it’s even possible to solve this.
The worst thing about LLMs is they can pass the Turing test, leading people to believe they have an Asimov style robot instead of a very cool statistical model. It feels like they should be able to follow instructions or keep instructions from content separate, but that’s not what’s happening.
Regarding:
``` # Development Workflow
*Always use `bun`, not `npm`.*
# 1. Make changes
# 2. Typecheck (fast)
bun run typecheck
# 3. Run tests
bun run test -- -t "test name" # Single suite bun run test:file -- "glob" # Specific files
# 4. Lint before committing
bun run lint:file -- "file1.ts" bun run lint
# 5. Before creating PR
bun run lint:claude && bun run test ```
I have these things in pre-commit, this way the targets are always ran and the agent is forced to fix them (I ask claude to commit changes). The agents are erratic and very often skip these steps. Anything that can be deterministic I keep as scripts.
Regarding commits; both codex and claude are terrible at writing them. I have in my user CLAUDE.md:
``` Pattern: `type(scope): message` where type is `fix`, `feat`, `chore`, `docs`, `refactor`, or `style`; scope marks what is affected; message is a short lowercased description.
Keep subject and body lines under 72 characters. Always write a body explaining what, how, and why in continuous human-readable text. For fixes include the error message being fixed. No first-person speech. Re-read the actual git diff before writing — the message must describe what changed, not what was planned.
Use following command to create commit:
```bash git commit -F - <<'EOF' type(scope): subject line
Body paragraph explaining what, how, and why. EOF ```
```
Without it would write the body as a single long sentence; when asked to fix lines it would just insert \n (newlines), which were not respected and were instead just rendered as characters.
Another thing I find helpful is VOCABULARY.md. Very often the agent would assume (connect?) a different thing than what I had in mind, with VOCABULARY I make sure when I say "thing" claude and I have both the same "understading" (connection?) what "thing" is.
I mean at this point, you should just write a few deterministic orchestration scripts to automate away the boring parts and write the code yourself. Why are we wasting our time on making the wonder shit-machine work?
Isn't it simpler to use claude's vocabulary? I don't see a good use case for this.
To understand a solution you must first understand the problem. If your whole company calls its customers "clients" but claude finds that confusing, I think it's probably easier to tell claude that then get everyone in the company to change how they talk.
This was very difficult to read. We really need to snap out of letting LLMs write posts. Even if there is some added value in this post, the feeling of chewing sand is just distracting and unnecessary.
Claude Code with skills is undoubtedly powerful and useful, but it doesn't always work as expected.
I always get the best results when I have live feedback with it.
In the recent weeks, I think the harness/model came to a point that you can just ask it to do stuff and it just does. You can use plan mode, you can also use superpowers, or whatever other skill, but given that you'll review something anyway, why not work directly with code instead of silly amounts of md files?
I like having a spec file that is used to generate the code. It's more dense and easier to understand what the application is supposed do. Prior to AI Agents, I had a more complex relationships with requirements because not all devs updated them. I was confused if the spec or code was the correct behavior for any aspect of the application.
> but given that you'll review something anyway,
If you aren't using AI for code review in 2026, why would you even bother? High quality, error-free, better-than-human code generation AND review is available for cheaper than ever. Why are you wasting your life reading code you didn't even write?
Because it might not have done what I wanted it to do. Also, just as with normal code review, I’m not just looking at the code but the final product. Maybe I realize after that I asked it to do something that was wrong?
In the recent weeks I trust Claude less and less. Yes, you can ask it to do stuff and it does stuff. But if you do look what it did you will often find corners cut, work based on assumptions and not verification, a lot of stuff missed. Even tests - it is common for it to write tests which in reality test nothing.
I’m getting into the agentic coding (I know, late to the party, and that’s been a good spot for my experience and use case), so I’m reading with interest. The first tip: “give Claude a way to verify its own work”.
So what’s the recommendation for Claude to have a feedback loop?
Because it’s not what follows in the article: _“Explore, then plan, then code.”, “Use plan mode…”, “Reference, do not describe.”_
Have tests, provide screenshots (or enable it to navigate the UI): https://code.claude.com/docs/en/best-practices#give-claude-a...
In my experience, the biggest benefit comes from having good quality integration and unit tests that are easy for the agent to run on its own to verify its work against.
Typically for most code it's telling claude how to run tests.
For front end code it's giving claude a way to 'see' the work for example a Playwrite MCP server seems common. https://playwright.dev/docs/getting-started-mcp
Best Claude Code daily-driver guide I’ve read. Though I’ve only read two. The “let Claude write rules for itself” CLAUDE.md pattern is the highest-ROI habit in there. Buth here’s the thing. The assumption underneath: this works when Claude mostly follows CLAUDE.md. Anthropic’s own engineering post from May 25 (https://www.anthropic.com/engineering/how-we-contain-claude) reports their telemetry shows ~93% of permission prompts get clicked through and ~17% of dangerous actions slip past the auto-mode filter.
Their conclusion: environment-layer containment first, then model-layer steering. CLAUDE.md is the right configuration layer but it is not a containment layer. Worth thinking about whether your worst case is a lost afternoon or a lost database and all backups deleted, too: https://safebots.ai/compromise.html
But the more important point are the costs. People are starting to realize just how costly it can be to run agents without precomputing and caching: https://safebots.ai/costs.html and self-orchestrating agents can go up to 1000x: https://safebots.ai/kimi.html
What's the standard for a "battle station" interface to manage agents for programming (using isolation with maybe git work tree and ideally VMS ?)
I found this one: do you guys know something else ?
How much time do you lose when doing things like "verify plan with a second clean agent" instead of just reading and fixing it yourself in 5 min? How much understanding do you lose? How do you manage to treat it "as an engineer" where it's clearly not there yet? How much time do you lose when it makes almost the same mistake, invents stuff or tries to gaslight you over and over? What about blood pressure?
> How much time do you lose when doing things like "verify plan with a second clean agent" instead of just reading and fixing it yourself in 5 min?
The marketing strategy for the AI firms is to get people with poor reading and writing skills socially dependent on their "tools".
The selling point is that you can delay "reading and fixing it yourself in 5 minutes" ad infinitum, consequences be damned.
What we gain from LLMs is avoiding (heaven forbid) having to read and write for another 15 minutes.
Why are there so many flagged comments in here? They all look fairly banal but yet still flagged.
The majority seems AI-generated slop.
Out of curiosity, how much does it cost to daily drive Claude like this?
I only use opus 4.7 and am on the 100$/mo plan. I usually make sure the context does not grow beyond 30-40% of the 1m tokens. On heavy coding days where I do something pretty similar to this, I would occasionally run into the five hour limit, but that happens like once per week and then it wouldn't take too long to reset. Note that I use caveman, but I'm not sure to what extent that really helps.
about 10-22€/month is the minimum since you need Claude Code, which means you either need the pro subscription (22€) or an API with some credit on it
isn't it $20/month /s
The post goes to the point. Somehow this must be buried in Anthropic's documentation but I miss this kind of back-to-basic posts. Even if they are LLM-penned.
The author’s claim that Claude is a multiplier for skill is probably true for now but it also feels like cope inspired by usability issues with Claude. The advice is all good, but none of it is especially clever or impressive or hard to grasp. The multiplier just comes from the fact that anthropic hadn’t taken this essay and several similar ones and incorporated their feedback into the product. This is a pretty shallow most of expertise that anthropic ought to automate in a week.
My complaint with anthropic is actually the opposite. They seem too focused on building this suite of products (because they want lock-in), but they can’t even get the availability and speed on their models in an acceptable state. It really does seem like they’re falling behind google and OpenAI at the moment.
Nerds and their tendency to over-complicate everything. What is wrong with just an IDE with a simple claude integration?
I agree, I find that just telling claude to use the CLIs I would have used anyway in the prompt works just fine. Use gh to do X, use az to do Y, build using Z. The harness handles the rest. All these MCPs, Skills, plugins, etc are just noise
Claude is not a simple technology. Why are you trying to stuff a 4-dimensional peg in a square hole.
"Claude Code as a Daily Driver", which was also used to generate this article..
Also, how is "Explore, then plan, then code" considered "beyond the basics"?
I’ve used Claude for a couple of months now and didn’t know about the specific “plan mode” you can put it into!
To me, this kind of talk exhibits the very cultish and con side of the whole genAI train. In a way, it does a poor job especially when the intent is positive about the technology, it sheds a bad look on it.
Generally, and more so with paid products, one should expect to get something that is ready to be used, tuned by who's selling it at the best of their efforts. Instead, this is basically saying that the product is actually not much more than an empty box, and that it is your responsibility to augment it with third-party plugins and markdown texts that make it finally useful. And you better be carefully selecting the skills you install, you don't want to end up with second tier material made by GithubInfluencerA, you definitely need the work of GithubInfluencerB.
In the end, it's what is giving companies fuel to keep the hype running, because it allows to counter every possible argument or doubt about the technology, especially the ones made in good faith. No matter the problem you're facing, the blame is definitely on you, the user, for not setting up the tool in the right way.
I'm struggling in a lot of ways in accepting LLMs, but if I'll ever come completely sold on them and take this technology seriously, it won't be before this mood has gone away.
I see this kind of first-gen coding agents a bit like the AI-era microsoft excel: you need to be a poweruser to use it correctly, otherwise you'll end up failing catastrophically. Hence the amount of different ways to use it.
Having an "unfinished" product is also a great marketing tool for companies like anthropic: each skill/plugin/guide that you see on the internet is boosting their SEO + social validation metrics.
I understand and sympathize with this point of view.
I would just say this: there is a difference between advice for using a product, and for _optimizing_ your use of a product. Between a user and a power user.
I think devs probably disproportionately like to see themselves as power users of any given tool, and thus with coding agents, there are 1000 "systems" being thrown out on GitHub on any given day. Generally speaking, it is safe to avoid these, especially if you're new to the tool.
But saying the fact that people are into optimizing their setups indicates some fundamental deficiency of the tool misses the point, I think.
Claude Code and Codex CLI (and OpenCode, and I'm sure many others) are _remarkably_ effective right out of the box. The teams behind these tools must make them _generically_ useful so that they are accessible to as many people, and as many use cases, as possible. That is part of why, when you become familiar with the tool, there is typically going to be a level of customization you can apply to it to optimize it for _your_ use cases, beyond the generic out of the box configuration.
Similarly, I don't think it would be fair to critique VS Code simply because most power users augment it with a suite of extensions. In fact, it's customizability/extensibility is part of what makes it great.
I absolutely understand the power user perspective. The point is not that, and maybe I wasn't clear enough in pointing it out.
Here, something different is going on instead of the usual "base tool is ok for 90% of use cases, remaining 10% is covered by plugins and extensions". A lot of developers are finding it difficult to commit to agentic coding workflows, feeling a stretch on a lot of different aspects.
Companies, with the help of a very prominent and vocal part of the web and social media community, are addressing every issue by simply blaming the users, saying it's their fault if they're not keeping up with all the alleged advancements in prompt strategies. See the whole "maybe you haven't tried it in the last two months, everything's changed now". While it's true that things have been moving very fast, the fundamental idea behind the technology is the same, and some concerns about it simply cannot be wiped away by scaling some factors.
To me, this kind of talk exhibits the very cultish and con side of the whole genAI train ... Generally, and more so with paid products, one should expect to get something that is ready to be used
Right like I bought an AWS EC2 m6a.metal instance expecting to get something that is ready to be used. Now being told to recite arcane "commands" from the cloud computing holy book. They claim their supposedly groundbreaking hypertext protocol isn't even accessible to mere mortals using a $6000/month EC2, the blame is definitely on you, the user, for not setting up the tool in the right way.
This sysadmin cloud cult is basically saying that the EC2 product is actually not much more than an empty box, and that it is your responsibility to augment it with third-party servers and interpreters and application source texts that make it finally useful. And you better be carefully selecting the tools you install.
Oh great! Another AI slop article about "working" with AI (= working for AI). Do you notice how much bloody work you put in the boring parts, only to leave out the most creative aspect of software engineering to a slot-machine?
Written by an LLM, deployed by an agent to the blog, posted to HN by a bot, upvoted by more bots to market "AI".
> Delegate, do not pair-program. Cat Wu (Claude Code team): “The model performs best if you treat it like an engineer you’re delegating to, not a pair programmer you’re guiding line by line.” Write a crisp brief upfront, then let it run.
This is also how you get a slop codebase that you won’t easily understand.
It becomes a labyrinth that only the Agent knows. It’s not a catastrophe when your making prototypes or projects like you see on X.
But if you are expanding your codebase or trying to build something more professional and maintainable. I find it important to explicitly spec things bit by bit so I can understand and some what keep my writing style in this codebase. But this is only productive when you have a fast model otherwise it kills your chain of thought while you wait for the output.
If the model is slow, delegation is probably the only way.
Honestly, claude code has saved so many hours of finding bugs for developers
For lazy cretins maybe
I agree. In fact, computers in general are for lazy cretins who can't use a pen and paper. We got man into space calculating with a pen and paper, if it was good enough then, it is good enough now. I like your concept, it should go further, cars are for people too lazy to walk. Planes are for people too lazy to flap their arms. Video cameras are for people too lazy to draw each frame by hand in real time then play them in a hand cranked projector.
Please. Don't compare the objectively useful deterministically operating tools with the stochastic shit-generating-machines.
Bro go take a walk really, get some fresh air maybe, get a grip jeez