Throwing this out there, I have a command line driver for LLMs. Lots of little tricks in there to adapt the CLI to make it amiable for LLMs. Like interrupting a long running process periodically and asking the LLM if it wants to kill it or continue waiting. Also allowing the LLM to use and understand apps that use the alternate screen buffer (to some degree).
Overall I try to keep it as thin a wrapper as I can. The better the model, the less wrapper is needed. It's a good way to measure model competence. The code is here https://github.com/swax/NAISYS and context logs here for examples - https://test.naisys.org/logs/
I have agents built with it that do research on the web for content, run python scripts, update the database, maintain a website, etc.. all running through the CLI, if it calls APIs then it does it with curl. Example agent instructions here: https://github.com/swax/NAISYS/tree/main/agents/scdb/subagen...
The one thing that I always wonder is how varied are those interactions with an agent. My workflow is is enough of a routine that I just write scripts and create functions and aliases to improve ergonomics. Anything that have to do with interacting with the computer can be automated.
Yea a lot of this is experimental, I basically have plain text instructions per agent all talking to each other, coordinating and running an entire pipeline to do what would typically be hard coded. There’s definite pros and cons, a lot of unpredictability of course, but also resilience and flexibility in the ways they can work around unexpected errors.
I do think it’s interesting how Claude Code makes shell and dev automation more important – it also makes testing and code review more important
So there is probably some room for innovation here
But most of these seems like problems with Claude (and maybe fundamental problems with LLMs), not problems with the CLI interface:
This started a game of whack-a-mole where the LLM would also attempt to change the pre-commit hooks! I had to fix it by denying Edit(.git/hooks/pre-commit) to my project’s .claude/settings.json. I look forward to its next lazy innovation.
If you watch Claude Code, you’ll see that it often uses head -n100 to limit the results apriori. It also gets lost about which directory it’s in, and it will frustratingly flail around trying to run commands in different directories until it finds the right one.
Agree with the whack-a-mole effect, where it goes from nailing the problem or bug to absolutely destroying the code. I would offer some of these MCP tools I wrote/had written to solve the problem: https://github.com/kordless/gnosis-evolve. Tools are in contrib-tools.
It has helped tremendously having a dedicated build service that CC can control through MCP vs running Docker itself because it can then restart the container and test. And, the fuzzy search tool and diff editor seem to perform better than the replacement strategy Claude Code uses, most of the time. I continue to work on the editor when I run into issues with it, so happy to help anyone interested in implementing their own file editing (and search) strategy.
You will need to convert these to Claude Code format, but all you need to do is ask CC to do it for you...
Your license is interesting. To meet your intent, I suggest you revisit this definition:
> “Military Entity" means any armed forces branch, defense department, or military organization of any nation or alliance.
As written, this only applies to nation states. It excludes many kinds of human organizations that use force to impose their will on others. The word for this is “Terrorist.”
While that term has been applied to many groups for many reasons, it technically means “the use of violence against non-combatants to achieve political or ideological aims.”
If you add Terrorism + Nation-State Militaries should cover most everyone you intend here, including organized crime and private military contractors. You could add “financial gain” to the definition if you want to ensure those last two are captured.
Or Freedom Fighter or Opposition Forces or Militia etc. The term used correlates significantly with how west-aligned the armed group is geopolitically.
With Terrorist it would probably be interpreted as whatever is assigned into US/EU terrorist organization lists.
The words “license” and “interesting” should not be in the same sentence.
Unless you are a lawyer practicing IP law, do not attempt modify or customize licenses, period. Contrary to what pbronez is saying, you have no guarantee that the terms “Terrorism + Nation-State Militaries” will function anything like you expect them to. Not to mention that most mainstream licenses have had to be crafted in special ways to deal with differences in what different countries will even allow you to put in a license (you can’t generally just write “by using this software you agree to name your child after me” and have it actually hold up).
If you don’t want your software to be used by the military industrial complex, I get it, but DO NOT just try to hand-spin a license based on what seems to make sense.
Instead, consider that you’re not the first person to want this, and there are probably like 3 existing licenses crafted by actual lawyers who know what they’re doing and can also explain the possible pitfalls in using the license.
I'd argue that many CLI tools output too much log spew by default and rely on making humans take up the burden of parsing through masses of output to find the one useful line.
For another example of where this is a problem, look at any large company that pays to keep logs in kibana, the amount of over logging paid for is insane.
Approximately 1/3rd of my Claude code tokens are spent parsing CLI output, that is insane! Often Claude doesn't catch what it needs in the massive log outputs and I have to go through the logs myself to find the problem and point it out to Claude.
I'm not entirely convinced by the examples given in the article. For starters, there's already a perfectly reasonable way to lazily buffer outputs for CLI commands: pagers. If LLMS aren't able to figure out how to pipe to `less` and send the escape sequence for the down arrow to read until they're done instead of repeating the same builds to get the error output small amounts at a time, it's seems a bit overconfident to assume that it's worth investing in larger changes to design around the undesirable local optima they seem to be getting stuck in today. Furthermore, there already _is_ a way to get structured output today from cargo build; it has the ability to emit diagnostics as JSON rather then plaintext. I'm actually a bit surprised that this isn't already something they're taking advantage of. It's not clear to me whether this is a limitation of the model or the configurations of the sector author that prevent this, but either way, I'm not really sure the problem lies with the interface as much as how it's being used.
As for the git CLI...yeah, that one sucks for humans too. There have been some recent improvements, like the addition of subcommands like `switch` to try to replace of the API space previously provided by `checkout` command, which was semantically overloaded to an absolutely staggering degree. I feel like there's been more than enough pain for the humans using the git CLI to justify completely overhaul the interface for years now though, and that hasn't happened yet, so I'm skeptical that the situation for LLMS will realistically improve all that much either. On the other hand, I've found using `jj` as an extremely suitable replacement for interacting with git repos to the point where I seem to have shed quite a bit of my working knowledge of handling some of the messier rebases that I'd previously be fairly used to having to do on a regular basis. As someone who frequently was the one who helped teammates when they had to deal with messy conflicts when pulling in changes that would occur when they were working on benches that took a long time to be ready to merge, it's a bit surreal to realize how much mental energy I've been able to recoup by just not having to care about doing things directly via the git CLI anymore. Maybe LLMs would have a better time working with git repos through jj as well.
Or you can give your AI agent access to your terminal. I've been using https://github.com/hiraishikentaro/wezterm-mcp/ with gemini-cli and it generally allows it to use the terminal like I would, so stuff like scrolling inside interactive TUIs etc more-or-less just works.
I’ve been building a context-engineering tool for collaborating with LLMs. The CLI is for the human and the MCP is for the LLM, but they all map to the same core commands
I think part of this is that we're in a transition phase. The shell cmds we have built (for example) were built for human consumption (ex. manpages). They were built around the expectation that we learn how to use it through experimentation or were taught by more knowledgable peers. In the AI world, we basically need to assume that role of the guide / sherpa for the LLM.
Another idea that I've been thinking about is context hierarchy:
Low -> High Utility
Base (AI reads tool desc/manpage,etc.) > General human advice (typically use grep this way, etc.) > Specific advice (for this project / impl this is how you use the tool).
Currently the best interface to provide our insights are via MCPs. At https://toolprint.ai/ we're building a human (or machine) driven way supplement that knowledge around tool-use to Claude/Cursor, etc.
A practical way in which we dogfood our own product is with the Linear MCP. If you connect that and ask an agent to create a new issue, it predictably fails because there's no instructions on which linear project to select or the correct way to provide a description around Linear's quirks. When we connect the linear mcp via the toolprint mcp, it gets pre-primed context around these edge cases to improve tool use.
The shell is an interface. The computer is the tool. Then we find that we have workflows that are actually routine. And we create scripts to handle them. Then we find that they are contextual, and we created task runners to provide the context.Then our mental capacity is freed, while the computer takes care of the menial stuff. And everything is good.
That is generally how it goes for power users. And people that takes the time to RTFM.
But now, I see people that don't want to determine their workflows. It's just ad-hoc use, spending their time on local decisions that doesn't matter that much instead of grasping the big picture and then solves it. Maybe it helps them looks busy.
So I don't want an agent for Linear. What I want is maybe a bash alias "isc" (for "issue create"), that pops up nano where I write the issue in git commit format (title + blank line + description). Upon saving, the rest is done automatically, because it can determines the project based on some .trackerrc I put in the root of the project. Or maybe a "linear-issues" emacs command and a transient interface (again, the correct project can be determined automatically).
Agree 100% that CLI interface design needs to be altered to include AI Agents as a new type of user persona, but I don't think it's as drastic of a change as one might expect.
We designed Desktop GUI & Web Browsers on top of the terminal to allow a type of user to interact without speaking "lower level" commands, but we've also created abstractions to hide complexity for ourselves at this layer. We just so happen to call them CLI Apps, Scripts, Makefile targets, Taskfile tasks, Justfile recipes, unix tools, etc. It consists of a pseudo-natural language short-code name combined with schema-validated options and some context around what each option does (via the --help view). The trick is how do we optimize for both human developers and AI Agents to have access to the same tools but in the optimized interface for each.
In an experiment to let my agents share the exact same 'tools' that I do for developing in a repository, I gave it direct access to load and self-modify the local project Justfile via MCP: https://github.com/toolprint/just-mcp
Just as (pun intended) I create tools for myself to repeat common tasks with sane defaults and some parameters, my agents immediately gain the same access and I can restrict permissions to use these instead of ANY bash command (IE: "Bash(just:*)"). The agent can also assist in creating tools for me or itself to use that would save on time and token usage. I'd love to see the paradigm evolve to the point it feels more like warp.dev where you don't have to switch between two text boxes to choose whether you're talking in natural language or instructing to run a known 'tool'.
Interfaces and tools are orthogonal. It's like a hammer. The head is what is used on the nail. While the handle is shaped to fit the human hand. We can modify one without modifying the other. Another good example is Magit (or Lazygit) and git. Magit is designed to be used interactively, while is more about the domain of version control.
Workflows are humans processes, what we do is naming them and identify their parameters. The actual tools to implement those workflows don't matters that much at a human scale other than cognitive load. So I don't care much about gcc various options. What I want is `make debug` or `make release` (or just `make`). And cognitive load is lowered because I only have these to remember and they are deterministic.
Agent is not a good bridge between humans and tools. Because they increase cognitive load, while all the interface have been about lowering it. There's no "make test" and have a nice output of all the lines that have been flagged (and have some integration like Vim's quickfix which can quickly bring you to each line). Instead it's typing a lot and praying that it actually do something good.
I don't think I disagree with you here, but I'm not sure I fully understand your position.
I agree that if the human is "driving", they should be able to use the Tool directly (IE: make test). If you put an agent in the middle and ask it "please run make test" that's just silly and costs extra for no benefit.
Where you get benefit is if you design tools like "just test" as an MCP tool called "mcp__just-mcp__test" and give a fully-autonomous agent instructions like: "Whenever you feel like you've completed a task, run mcp__just-mcp__test and fix errors and warnings until it passes, then you may commit changes locally". LLM's have 'congitive load' as well, so why not offload the deterministic logic to Tools in the same way we do?
My point is whenever the instructions feels repetitive, it's better to write it down and instead have a trigger. And if you find yourself with a lot of trigger, then there's usually some kind of abstraction that reduce them.
> give a fully-autonomous agent instructions like: "Whenever you feel like you've completed a task, run mcp__just-mcp__test and fix errors and warnings until it passes, then you may commit changes locally"
Why not have some script named `work-on` like:
agent --from-file $task_file
make test > test.log
test_status = $last_status
while test_status is error
agent --from-file fix-test-errors.txt
make test > test.log
test_status = $last_status
git commit
I think we are both approaching the situation with the same intent, which is: "when I have a repetitive task with a small cardinality of input options, I want to create a deterministic abstraction and an easy-to-invoke trigger".
If the implementation and execution of the script is considered separate, I just want my agent to immediately know "how it's supposed to be used" for any new script I just wrote and be given scoped permissions for it. If it's given full Bash access it can certainly invoke it as I would, but unless the documentation in the script is extensive, it might not know all the context I do about how and when to use it properly. Plus, the output may be overly verbose by default and waste tokens, so it should make sure to only call in a more "quiet" mode.
The original point of this thread was around "how to design CLI better for AI Agents", so the question is if we can do better from a token efficiency standpoint than writing the same scripts as before. Perhaps simple hook-driven actions are not good examples of where things may be significantly improved.
> If the implementation and execution of the script is considered separate, I just want my agent to immediately know "how it's supposed to be used" for any new script I just wrote and be given scoped permissions for it.
Then there's only two path I can think of. Either consistency in the help system so that the agent can recursively determine a path by asking questions and getting answers (which no one does really, other than children). I think that's what MCP is all about. The other is to have these kind of overarching workflows/scripts with agents sharing the same context, but each uniquely suited for a specific task.
But I can't find any pros for agent over training yourself and having a specialized and deterministic toolset. If you look hard enough at the prompts, you'll find enough similarity between them to build out a script.
I have found just creating a scripts directory and telling Claude Code to run the scripts is pretty effective. MCP seems like overkill for this use-case.
If I can learn how to use the Bulk Rename Utility (it’s actually quite useful once you get to grips with it), then AI should be able to as well. ;)
There’s the saying that computers should adapt to us, rather than the other way around, but now this makes me wonder which side LLMs are on in that picture.
Bulk Rename Utility is excellent and I've used it a lot in the past. Ironically, I've been thinking about replacing it with an LLM based rename tool that can look at each filename and make decisions about how to rename them (I'm usually trying to rename tens of thousands of PDFs which have the date written in a dozen different format and languages and normalize them all to dd MM YYYY).
The --no-verify example is interesting because I can imagine the same hint being useful for junior engineers. In general it's hard to give the the right level of advice in cli docs because you don't always know who the consumer will be and so what knowledge can be assumed. The think that makes LLMs different is that there's no problem to being verbose in the docs because you're not wasting any human's time. It would be cool if you could docs that provide extra advice like in the example and then the interface adapted for the users context - for LLMs provide everything, for human users learn what they know give them just the right level of advice
I had the same thought when I was dog fooding a CLI tool I've been vibe coding. It's a CLI for deploying Docker apps on any server. An here is the exact PR "I" did.
One of the advantages of vibe coding CLI tools like this is that it's easy for the AI to debug itself, and it's easy to see where it gets stuck.
And it usually gets stuck because of:
1. Errors
2. Don't know what command to run
So:
1. Errors must be clear and *actionable* like `app_port` ("8080") is a string, expected a number.
2. The command should have simple, actionable and complete help (`--help`) sections for all commands.
This is a great post, thank you for sharing. I like the idea of giving hints to the LLMs.
To clarify, the example that was provided using `command_not_found_handler`, is that possible to implement in bash? Or perhaps you were saying this would be a nice to have if this functionality existed?
Just yesterday I updated a tool to parse and snip sections of manpages I made in 2020 to have an LLM ingestion feature for fitting partial manpages into tight context windows (https://github.com/day50-dev/Mansnip).
I just built a library designed to help with part of this: detecting if a tool is being run in one of these environments. That would allow it to, for example, run in non-interactive mode or give extra context in logs.
The complicate GUI is simply a visualized version of CLI utilities of the day, which were no less complicated.
I was thinking about this just the other day, and there was one from the late 80s that had scores of parameters, but I could not remember its name. I think it was an `ls` type utility.
Will this just be solved by agents being multimodal and using a computer in a more human way - context is a solved UI problem, by the GUI. The GUI just lacks power - but an AI could just have access to both.
Sorry for the snark, but we couldn't even do this for humans, but let's do it for poor poor LLMs? It's kind of ironic that NOW is the time we worry about usability. What happened to RTFM?
> This started a game of whack-a-mole where the LLM would also attempt to change the pre-commit hooks! I had to fix it by denying […]
When will people acknowledge that LLMs are stochastic text generators?
This whole blog reads like trying to fit a square into a round hole. And frankly most of the comments in this thread is jumping right on the wagon “what water?”-style [1]
By all means use LLMs for what they can be useful for but god damnit when they are not useful please acknowledge this and stop trying to make everything a nail for the LLM-hammer.
LLMs are. not. intelligent. They don’t have a work ethic that says “oh maybe skipping tests is bad”. If they generate output that skips tests it’s because a high enough part of the training data contained that text sentence.
The whack-a-mole thing is a huge "this thing is not useful" indicator to me, and I am really confused how other people don't see it. Ok, there's an agent and the agent is able to figure out stuff and do stuff on its own. Great. But it's trying to cheat and instead of doing what I'm asking it just tries to go the easiest fastest way to claim "job done". How is that useful? If I had an intern do this I would seriously consider getting rid of them.
This is elementary school stuff. Do the assignment, don't cheat. Does useful software get written by people who don't understand this basic fact?
We don't necessarily need to replace the versions humans use - though some of the changes might well make tools better for humans too - but e.g. most of the tools I add for my coding agent are attempts at coaxing it to avoid doing things like e.g. the "head" example in the article.
That’s just evidence that these sophisticated next token predictors are not good enough yet. The works should not bend over backwards to accommodate a new tool. The new tool needs to adapt to the world. Or only be used in the situations where it is appropriate. This is one of the problems of calling LLMs AI: a language model lacks understanding.
Many of us have actual work to get done, rather than conform to purity tests for the sake of it. Nobody will erase the other versions of these tools by making adaptations that are more suitable for this use.
Why rethink tools that have existed since the 70s and function predictably for a landscape that drastically shifts every two months? Seems shortsighted to me.
IDEs have changed a lot in the last 50 years. Just like we shouldn't advocate for hand writing assembly for all code, we shouldn't be stuck using CLI tooling the same way.
I share your apprehension regarding the current AI landscape changing so quickly it causes whiplash but I don't think a mindset of "it's been fine for 50 years" is going to survive the pace of development possible by better LLM integration.
The reason that tools have not changed that much is that our needs haven't changed that much either. Even something like `find` or `ffmpeg`, while complex, are not that complicated to use. They just require you to have a clear idea of what you want. And the latter is why most people advocating for LLMs want to avoid.
IDEs have not changed that much. They've always been an editor superchaged with tools that will all share the same context of a "project". And for development, it's always been about navigation (search and goto), compile/build, and run.
Not at all. The shell already provide us ways to get contextual information (PS1, ...). And the commands generally provides error message or error code.
In one of the example provided:
$ sdfsdf
zsh: command not found: 'sdfsdf'
zsh: current directory is /Users/ryan
zsh: Perhaps you meant to run: cd agent_directory; sdfsdf
You could just use `pwd`, like most people that put the current directory in the $PS1 to make sure that the agent stays in the correct directory.
Yeah, this example isn't great - you can just tell the llm to run pwd more frequently or something.
But for the `$command | head -100` example, the usage is a bit different. I run into this myself on the cli, and often ended up using `less` in similar context.
Two cases
1) sometimes I use head to short circuit a long running, but streaming output, command so I just assess if it is starting to do the right thing but not bear the time/computational cost of full processing
2) sometimes the timing doesn't matter but the content is too verbose, need to see some subset of the data. But here head is too limited. I need something like wc & head and maybe grep in one command line with context. Maybe something like
1) That's why some tools have a simulate option, or you can just do a kill 9 on the processes you've just launched. Just make sure you've capture their output in a file
2) Again logs, if actions needs to be taken after the command has stopped. For immediate action, you can use `tee`.
Managing context isn't hard. I see more issues with ensuring the right command.
I'm sorry, but no. The tools work. I don't need "more context" from my `less` or `more` commands. The LLM can train on the man pages just as a human can read the man pages.
What man page? I have never worked on a product with one. We're not teaching the LLM how to use `ls`; we are talking about the code being written today.
> I think watching the agents use our existing command line utilities get confused and lost is a strong indicator that the information architecture of our command line utilities is inadequate.
Seems pretty clear the article is talking about teaching LLMs how to use 'ls'.
> The agents may benefit from some training on tools available within their agents. This will certainly help with the majority of general CLI tools, there are bespoke tools that could benefit from adapting to LLMs.
You can pretty much obviate that with an alias that catches the user requesting something then operates deterministically. What is nice about aliases is you don’t need to learn other peoples semantic patterns, you craft ones that make sense to you and your use cases then they always work and consume virtually no resources to work.
Throwing this out there, I have a command line driver for LLMs. Lots of little tricks in there to adapt the CLI to make it amiable for LLMs. Like interrupting a long running process periodically and asking the LLM if it wants to kill it or continue waiting. Also allowing the LLM to use and understand apps that use the alternate screen buffer (to some degree).
Overall I try to keep it as thin a wrapper as I can. The better the model, the less wrapper is needed. It's a good way to measure model competence. The code is here https://github.com/swax/NAISYS and context logs here for examples - https://test.naisys.org/logs/
I have agents built with it that do research on the web for content, run python scripts, update the database, maintain a website, etc.. all running through the CLI, if it calls APIs then it does it with curl. Example agent instructions here: https://github.com/swax/NAISYS/tree/main/agents/scdb/subagen...
The one thing that I always wonder is how varied are those interactions with an agent. My workflow is is enough of a routine that I just write scripts and create functions and aliases to improve ergonomics. Anything that have to do with interacting with the computer can be automated.
Yea a lot of this is experimental, I basically have plain text instructions per agent all talking to each other, coordinating and running an entire pipeline to do what would typically be hard coded. There’s definite pros and cons, a lot of unpredictability of course, but also resilience and flexibility in the ways they can work around unexpected errors.
> It's a good way to measure model competence.
Can you elaborate?
I do think it’s interesting how Claude Code makes shell and dev automation more important – it also makes testing and code review more important
So there is probably some room for innovation here
But most of these seems like problems with Claude (and maybe fundamental problems with LLMs), not problems with the CLI interface:
This started a game of whack-a-mole where the LLM would also attempt to change the pre-commit hooks! I had to fix it by denying Edit(.git/hooks/pre-commit) to my project’s .claude/settings.json. I look forward to its next lazy innovation.
If you watch Claude Code, you’ll see that it often uses head -n100 to limit the results apriori. It also gets lost about which directory it’s in, and it will frustratingly flail around trying to run commands in different directories until it finds the right one.
Agree with the whack-a-mole effect, where it goes from nailing the problem or bug to absolutely destroying the code. I would offer some of these MCP tools I wrote/had written to solve the problem: https://github.com/kordless/gnosis-evolve. Tools are in contrib-tools.
It has helped tremendously having a dedicated build service that CC can control through MCP vs running Docker itself because it can then restart the container and test. And, the fuzzy search tool and diff editor seem to perform better than the replacement strategy Claude Code uses, most of the time. I continue to work on the editor when I run into issues with it, so happy to help anyone interested in implementing their own file editing (and search) strategy.
You will need to convert these to Claude Code format, but all you need to do is ask CC to do it for you...
Your license is interesting. To meet your intent, I suggest you revisit this definition:
> “Military Entity" means any armed forces branch, defense department, or military organization of any nation or alliance.
As written, this only applies to nation states. It excludes many kinds of human organizations that use force to impose their will on others. The word for this is “Terrorist.”
While that term has been applied to many groups for many reasons, it technically means “the use of violence against non-combatants to achieve political or ideological aims.”
If you add Terrorism + Nation-State Militaries should cover most everyone you intend here, including organized crime and private military contractors. You could add “financial gain” to the definition if you want to ensure those last two are captured.
> The word for this is “Terrorist.”
Or Freedom Fighter or Opposition Forces or Militia etc. The term used correlates significantly with how west-aligned the armed group is geopolitically.
With Terrorist it would probably be interpreted as whatever is assigned into US/EU terrorist organization lists.
The words “license” and “interesting” should not be in the same sentence.
Unless you are a lawyer practicing IP law, do not attempt modify or customize licenses, period. Contrary to what pbronez is saying, you have no guarantee that the terms “Terrorism + Nation-State Militaries” will function anything like you expect them to. Not to mention that most mainstream licenses have had to be crafted in special ways to deal with differences in what different countries will even allow you to put in a license (you can’t generally just write “by using this software you agree to name your child after me” and have it actually hold up).
If you don’t want your software to be used by the military industrial complex, I get it, but DO NOT just try to hand-spin a license based on what seems to make sense.
Instead, consider that you’re not the first person to want this, and there are probably like 3 existing licenses crafted by actual lawyers who know what they’re doing and can also explain the possible pitfalls in using the license.
Nation states might actually follow the license. Would terrorists?
I'd argue that many CLI tools output too much log spew by default and rely on making humans take up the burden of parsing through masses of output to find the one useful line.
For another example of where this is a problem, look at any large company that pays to keep logs in kibana, the amount of over logging paid for is insane.
Approximately 1/3rd of my Claude code tokens are spent parsing CLI output, that is insane! Often Claude doesn't catch what it needs in the massive log outputs and I have to go through the logs myself to find the problem and point it out to Claude.
I'm not entirely convinced by the examples given in the article. For starters, there's already a perfectly reasonable way to lazily buffer outputs for CLI commands: pagers. If LLMS aren't able to figure out how to pipe to `less` and send the escape sequence for the down arrow to read until they're done instead of repeating the same builds to get the error output small amounts at a time, it's seems a bit overconfident to assume that it's worth investing in larger changes to design around the undesirable local optima they seem to be getting stuck in today. Furthermore, there already _is_ a way to get structured output today from cargo build; it has the ability to emit diagnostics as JSON rather then plaintext. I'm actually a bit surprised that this isn't already something they're taking advantage of. It's not clear to me whether this is a limitation of the model or the configurations of the sector author that prevent this, but either way, I'm not really sure the problem lies with the interface as much as how it's being used.
As for the git CLI...yeah, that one sucks for humans too. There have been some recent improvements, like the addition of subcommands like `switch` to try to replace of the API space previously provided by `checkout` command, which was semantically overloaded to an absolutely staggering degree. I feel like there's been more than enough pain for the humans using the git CLI to justify completely overhaul the interface for years now though, and that hasn't happened yet, so I'm skeptical that the situation for LLMS will realistically improve all that much either. On the other hand, I've found using `jj` as an extremely suitable replacement for interacting with git repos to the point where I seem to have shed quite a bit of my working knowledge of handling some of the messier rebases that I'd previously be fairly used to having to do on a regular basis. As someone who frequently was the one who helped teammates when they had to deal with messy conflicts when pulling in changes that would occur when they were working on benches that took a long time to be ready to merge, it's a bit surreal to realize how much mental energy I've been able to recoup by just not having to care about doing things directly via the git CLI anymore. Maybe LLMs would have a better time working with git repos through jj as well.
Or you can give your AI agent access to your terminal. I've been using https://github.com/hiraishikentaro/wezterm-mcp/ with gemini-cli and it generally allows it to use the terminal like I would, so stuff like scrolling inside interactive TUIs etc more-or-less just works.
Appreciate the share!
I might give access to a terminal in a locked down VM, I don't know about a shell.
At least give it its own login (and no sudo privileges).
Somehow a whole industry is now fine with Heisenbugs being a regular part of the dev workflow.
the salary raise and promo project industry within large corps is fine with that
there is everyone else who is supposed deliver software that works, like always, and they are not fine with built-in flakiness
I’ve been building a context-engineering tool for collaborating with LLMs. The CLI is for the human and the MCP is for the LLM, but they all map to the same core commands
https://github.com/jerpint/context-llemur
I’ve actually bootstrapped ctx with ctx and found it very useful !
It basically stops me from having to repeat myself over and over to different agents
I think part of this is that we're in a transition phase. The shell cmds we have built (for example) were built for human consumption (ex. manpages). They were built around the expectation that we learn how to use it through experimentation or were taught by more knowledgable peers. In the AI world, we basically need to assume that role of the guide / sherpa for the LLM.
Another idea that I've been thinking about is context hierarchy:
Low -> High Utility
Base (AI reads tool desc/manpage,etc.) > General human advice (typically use grep this way, etc.) > Specific advice (for this project / impl this is how you use the tool).
Currently the best interface to provide our insights are via MCPs. At https://toolprint.ai/ we're building a human (or machine) driven way supplement that knowledge around tool-use to Claude/Cursor, etc.
A practical way in which we dogfood our own product is with the Linear MCP. If you connect that and ask an agent to create a new issue, it predictably fails because there's no instructions on which linear project to select or the correct way to provide a description around Linear's quirks. When we connect the linear mcp via the toolprint mcp, it gets pre-primed context around these edge cases to improve tool use.
The shell is an interface. The computer is the tool. Then we find that we have workflows that are actually routine. And we create scripts to handle them. Then we find that they are contextual, and we created task runners to provide the context.Then our mental capacity is freed, while the computer takes care of the menial stuff. And everything is good.
That is generally how it goes for power users. And people that takes the time to RTFM.
But now, I see people that don't want to determine their workflows. It's just ad-hoc use, spending their time on local decisions that doesn't matter that much instead of grasping the big picture and then solves it. Maybe it helps them looks busy.
So I don't want an agent for Linear. What I want is maybe a bash alias "isc" (for "issue create"), that pops up nano where I write the issue in git commit format (title + blank line + description). Upon saving, the rest is done automatically, because it can determines the project based on some .trackerrc I put in the root of the project. Or maybe a "linear-issues" emacs command and a transient interface (again, the correct project can be determined automatically).
Agree 100% that CLI interface design needs to be altered to include AI Agents as a new type of user persona, but I don't think it's as drastic of a change as one might expect.
We designed Desktop GUI & Web Browsers on top of the terminal to allow a type of user to interact without speaking "lower level" commands, but we've also created abstractions to hide complexity for ourselves at this layer. We just so happen to call them CLI Apps, Scripts, Makefile targets, Taskfile tasks, Justfile recipes, unix tools, etc. It consists of a pseudo-natural language short-code name combined with schema-validated options and some context around what each option does (via the --help view). The trick is how do we optimize for both human developers and AI Agents to have access to the same tools but in the optimized interface for each.
In an experiment to let my agents share the exact same 'tools' that I do for developing in a repository, I gave it direct access to load and self-modify the local project Justfile via MCP: https://github.com/toolprint/just-mcp
Just as (pun intended) I create tools for myself to repeat common tasks with sane defaults and some parameters, my agents immediately gain the same access and I can restrict permissions to use these instead of ANY bash command (IE: "Bash(just:*)"). The agent can also assist in creating tools for me or itself to use that would save on time and token usage. I'd love to see the paradigm evolve to the point it feels more like warp.dev where you don't have to switch between two text boxes to choose whether you're talking in natural language or instructing to run a known 'tool'.
Interfaces and tools are orthogonal. It's like a hammer. The head is what is used on the nail. While the handle is shaped to fit the human hand. We can modify one without modifying the other. Another good example is Magit (or Lazygit) and git. Magit is designed to be used interactively, while is more about the domain of version control.
Workflows are humans processes, what we do is naming them and identify their parameters. The actual tools to implement those workflows don't matters that much at a human scale other than cognitive load. So I don't care much about gcc various options. What I want is `make debug` or `make release` (or just `make`). And cognitive load is lowered because I only have these to remember and they are deterministic.
Agent is not a good bridge between humans and tools. Because they increase cognitive load, while all the interface have been about lowering it. There's no "make test" and have a nice output of all the lines that have been flagged (and have some integration like Vim's quickfix which can quickly bring you to each line). Instead it's typing a lot and praying that it actually do something good.
I don't think I disagree with you here, but I'm not sure I fully understand your position.
I agree that if the human is "driving", they should be able to use the Tool directly (IE: make test). If you put an agent in the middle and ask it "please run make test" that's just silly and costs extra for no benefit.
Where you get benefit is if you design tools like "just test" as an MCP tool called "mcp__just-mcp__test" and give a fully-autonomous agent instructions like: "Whenever you feel like you've completed a task, run mcp__just-mcp__test and fix errors and warnings until it passes, then you may commit changes locally". LLM's have 'congitive load' as well, so why not offload the deterministic logic to Tools in the same way we do?
My point is whenever the instructions feels repetitive, it's better to write it down and instead have a trigger. And if you find yourself with a lot of trigger, then there's usually some kind of abstraction that reduce them.
> give a fully-autonomous agent instructions like: "Whenever you feel like you've completed a task, run mcp__just-mcp__test and fix errors and warnings until it passes, then you may commit changes locally"
Why not have some script named `work-on` like:
And then use one of the git hooks to run and use $message for the commit.Then you will only have to run `work-on task-one.txt`. You can also make it an emacs command that runs on the current buffer and bind it to "C-c C-c"
I think we are both approaching the situation with the same intent, which is: "when I have a repetitive task with a small cardinality of input options, I want to create a deterministic abstraction and an easy-to-invoke trigger".
If the implementation and execution of the script is considered separate, I just want my agent to immediately know "how it's supposed to be used" for any new script I just wrote and be given scoped permissions for it. If it's given full Bash access it can certainly invoke it as I would, but unless the documentation in the script is extensive, it might not know all the context I do about how and when to use it properly. Plus, the output may be overly verbose by default and waste tokens, so it should make sure to only call in a more "quiet" mode.
The original point of this thread was around "how to design CLI better for AI Agents", so the question is if we can do better from a token efficiency standpoint than writing the same scripts as before. Perhaps simple hook-driven actions are not good examples of where things may be significantly improved.
> If the implementation and execution of the script is considered separate, I just want my agent to immediately know "how it's supposed to be used" for any new script I just wrote and be given scoped permissions for it.
Then there's only two path I can think of. Either consistency in the help system so that the agent can recursively determine a path by asking questions and getting answers (which no one does really, other than children). I think that's what MCP is all about. The other is to have these kind of overarching workflows/scripts with agents sharing the same context, but each uniquely suited for a specific task.
But I can't find any pros for agent over training yourself and having a specialized and deterministic toolset. If you look hard enough at the prompts, you'll find enough similarity between them to build out a script.
I have found just creating a scripts directory and telling Claude Code to run the scripts is pretty effective. MCP seems like overkill for this use-case.
Losing the sense of cwd is the reason why I append it in the output of each command run in wcgw mcp [1]
It rarely does it incorrectly after that.
I won't be surprised if claude code does the same soon.
However, they do have an env flag called CLAUDE_BASH_MAINTAIN_PROJECT_WORKING_DIR=1
This should also fix the wrong dir behavior.
[1] https://github.com/rusiaaman/wcgw
If I can learn how to use the Bulk Rename Utility (it’s actually quite useful once you get to grips with it), then AI should be able to as well. ;)
There’s the saying that computers should adapt to us, rather than the other way around, but now this makes me wonder which side LLMs are on in that picture.
Bulk Rename Utility is excellent and I've used it a lot in the past. Ironically, I've been thinking about replacing it with an LLM based rename tool that can look at each filename and make decisions about how to rename them (I'm usually trying to rename tens of thousands of PDFs which have the date written in a dozen different format and languages and normalize them all to dd MM YYYY).
The --no-verify example is interesting because I can imagine the same hint being useful for junior engineers. In general it's hard to give the the right level of advice in cli docs because you don't always know who the consumer will be and so what knowledge can be assumed. The think that makes LLMs different is that there's no problem to being verbose in the docs because you're not wasting any human's time. It would be cool if you could docs that provide extra advice like in the example and then the interface adapted for the users context - for LLMs provide everything, for human users learn what they know give them just the right level of advice
I had the same thought when I was dog fooding a CLI tool I've been vibe coding. It's a CLI for deploying Docker apps on any server. An here is the exact PR "I" did.
https://github.com/elitan/lightform/pull/35
One of the advantages of vibe coding CLI tools like this is that it's easy for the AI to debug itself, and it's easy to see where it gets stuck.
And it usually gets stuck because of:
1. Errors 2. Don't know what command to run
So:
1. Errors must be clear and *actionable* like `app_port` ("8080") is a string, expected a number. 2. The command should have simple, actionable and complete help (`--help`) sections for all commands.
Sounds like it applies to humans using CLIs as well.
For sure.
Step 1: a no-excuses, never-fails undo.
Cursor has checkpoints. Good idea.
https://docs.cursor.com/agent/chat/checkpoints
This is a great post, thank you for sharing. I like the idea of giving hints to the LLMs.
To clarify, the example that was provided using `command_not_found_handler`, is that possible to implement in bash? Or perhaps you were saying this would be a nice to have if this functionality existed?
The `command_not_found_handler` can be added to your .zshrc or .bashrc as is.
Just yesterday I updated a tool to parse and snip sections of manpages I made in 2020 to have an LLM ingestion feature for fitting partial manpages into tight context windows (https://github.com/day50-dev/Mansnip).
There may be something more generalizable here.
That's pretty cool, man (pun intended).
I just built a library designed to help with part of this: detecting if a tool is being run in one of these environments. That would allow it to, for example, run in non-interactive mode or give extra context in logs.
https://github.com/ascorbic/am-i-vibing
The complicate GUI is simply a visualized version of CLI utilities of the day, which were no less complicated.
I was thinking about this just the other day, and there was one from the late 80s that had scores of parameters, but I could not remember its name. I think it was an `ls` type utility.
Will this just be solved by agents being multimodal and using a computer in a more human way - context is a solved UI problem, by the GUI. The GUI just lacks power - but an AI could just have access to both.
Ironically, I really like bulk rename utility, it's quite nice
There is something about these power-user oriented tools. It does not try to hide complexity and shows right away the list of features it has.
I rather see improvements in voice control and hand writing, as means of communication.
Sorry for the snark, but we couldn't even do this for humans, but let's do it for poor poor LLMs? It's kind of ironic that NOW is the time we worry about usability. What happened to RTFM?
Emacs it's the cli/tui rethinked.
> This started a game of whack-a-mole where the LLM would also attempt to change the pre-commit hooks! I had to fix it by denying […]
When will people acknowledge that LLMs are stochastic text generators?
This whole blog reads like trying to fit a square into a round hole. And frankly most of the comments in this thread is jumping right on the wagon “what water?”-style [1]
By all means use LLMs for what they can be useful for but god damnit when they are not useful please acknowledge this and stop trying to make everything a nail for the LLM-hammer.
LLMs are. not. intelligent. They don’t have a work ethic that says “oh maybe skipping tests is bad”. If they generate output that skips tests it’s because a high enough part of the training data contained that text sentence.
[1] fish joke
The whack-a-mole thing is a huge "this thing is not useful" indicator to me, and I am really confused how other people don't see it. Ok, there's an agent and the agent is able to figure out stuff and do stuff on its own. Great. But it's trying to cheat and instead of doing what I'm asking it just tries to go the easiest fastest way to claim "job done". How is that useful? If I had an intern do this I would seriously consider getting rid of them.
This is elementary school stuff. Do the assignment, don't cheat. Does useful software get written by people who don't understand this basic fact?
command line interface interface
You can give AI access to your terminal, dude. I'm fine over here, thanks.
“Rethinking command line interface interfaces with AI” I would have expected nothing less from an airticle.
Language is just too hard a challenge.
The answer isn't "let's tear up and redo our tools for the hope that it will benefit black boxes with non-deterministic output".
The answer is:
- make non-deterministic black boxes more deterministic and less black boxes
- improve tools for humans
Fortunately improving tools for humans tends to improve them for the non-deterministic black boxes too
It's a feeling that really hasn't been tested or verified (like any other feeling when it comes to LLMs)
> We need to augment our command line tools and design APIs so they can be better used by LLM Agents.
lol no. The right way to get a program to interact with another program is through an API
The command line tools are also APIs.
We don't necessarily need to replace the versions humans use - though some of the changes might well make tools better for humans too - but e.g. most of the tools I add for my coding agent are attempts at coaxing it to avoid doing things like e.g. the "head" example in the article.
That’s just evidence that these sophisticated next token predictors are not good enough yet. The works should not bend over backwards to accommodate a new tool. The new tool needs to adapt to the world. Or only be used in the situations where it is appropriate. This is one of the problems of calling LLMs AI: a language model lacks understanding.
Many of us have actual work to get done, rather than conform to purity tests for the sake of it. Nobody will erase the other versions of these tools by making adaptations that are more suitable for this use.
Why rethink tools that have existed since the 70s and function predictably for a landscape that drastically shifts every two months? Seems shortsighted to me.
IDEs have changed a lot in the last 50 years. Just like we shouldn't advocate for hand writing assembly for all code, we shouldn't be stuck using CLI tooling the same way.
I share your apprehension regarding the current AI landscape changing so quickly it causes whiplash but I don't think a mindset of "it's been fine for 50 years" is going to survive the pace of development possible by better LLM integration.
The reason that tools have not changed that much is that our needs haven't changed that much either. Even something like `find` or `ffmpeg`, while complex, are not that complicated to use. They just require you to have a clear idea of what you want. And the latter is why most people advocating for LLMs want to avoid.
IDEs have not changed that much. They've always been an editor superchaged with tools that will all share the same context of a "project". And for development, it's always been about navigation (search and goto), compile/build, and run.
Many of the changes that would work for LLMs would also be beneficial to users.
Not at all. The shell already provide us ways to get contextual information (PS1, ...). And the commands generally provides error message or error code.
In one of the example provided:
You could just use `pwd`, like most people that put the current directory in the $PS1 to make sure that the agent stays in the correct directory.Yeah, this example isn't great - you can just tell the llm to run pwd more frequently or something.
But for the `$command | head -100` example, the usage is a bit different. I run into this myself on the cli, and often ended up using `less` in similar context.
Two cases
1) sometimes I use head to short circuit a long running, but streaming output, command so I just assess if it is starting to do the right thing but not bear the time/computational cost of full processing
2) sometimes the timing doesn't matter but the content is too verbose, need to see some subset of the data. But here head is too limited. I need something like wc & head and maybe grep in one command line with context. Maybe something like
$command | contextual-filter -grepn 5 -grep error -head 10
some data ... first the first 10 lines ... an error message with 5 lines of context surrounding before and after
Summary: 100000 total lines 15 printed exited with code 0
You can do all that already with grep and others, but you need to run multiple commands to get all the context
1) That's why some tools have a simulate option, or you can just do a kill 9 on the processes you've just launched. Just make sure you've capture their output in a file
2) Again logs, if actions needs to be taken after the command has stopped. For immediate action, you can use `tee`.
Managing context isn't hard. I see more issues with ensuring the right command.
I'm sorry, but no. The tools work. I don't need "more context" from my `less` or `more` commands. The LLM can train on the man pages just as a human can read the man pages.
What man page? I have never worked on a product with one. We're not teaching the LLM how to use `ls`; we are talking about the code being written today.
edit: Mea culpa.
> I think watching the agents use our existing command line utilities get confused and lost is a strong indicator that the information architecture of our command line utilities is inadequate.
Seems pretty clear the article is talking about teaching LLMs how to use 'ls'.
> The agents may benefit from some training on tools available within their agents. This will certainly help with the majority of general CLI tools, there are bespoke tools that could benefit from adapting to LLMs.
Definitely not just `ls`.
I also feel like command line agents are pretty simple. It's tailor made for tool-use.
while(true):
>> User requests something
<< The LLM picks a cli tool from an index
<< LLM grabs the manual for the tool to get the list of commands
<< Attempts to fulfill the request
I would not be shocked if engineers have already managed to overcomplicate these agents.
You can pretty much obviate that with an alias that catches the user requesting something then operates deterministically. What is nice about aliases is you don’t need to learn other peoples semantic patterns, you craft ones that make sense to you and your use cases then they always work and consume virtually no resources to work.