Hey all, member of the benchmark team here! The goal for this project was to see how LLMs well could do bookkeeping without an overly opinionated scaffold. We gave them access to processed transaction records and code execution tools, but it was up to them to choose exactly how to use those.
Claude and Grok 4 did reasonably well (within CPA baselines) for the first few months, but tended to degrade as more data came in. Interestingly, the failures aren’t exclusively a context length problem, as we reset the context monthly (with past decisions, accruals/deferrals, and comments available via tool calls) and the types of errors appear to be more reward hacking vs pure hallucinations.
Accounting is very interesting in an RL-first world as it is pretty easy to develop intermediate rewards for training models. We are pretty sure that we can juice the performance more with a far more rigid scaffold, but that’s less relevant from a capabilities research perspective. We’re pushing down this research direction and will see how it goes.
It's a start. The world needs a better way to handle bookkeeping, and the existing tools sure aren't cutting it.
Bookkeeping for my small business runs into the tens of thousands of dollars every year, and the amount of human error associated with processing assorted ecommerce and other transactions is astounding, even after extensive planning and SOPs.
The other pain point is Quickbooks. The tool is so sprawling and complex that half the time support agents can't figure out what's wrong. The fact that Intuit jacks up the price every year for this POS is very irritating. They get away with it because they are practically a monopoly, with most small business CPAs locked into their ecosystem.
Hope your team can work out the performance issues. Alternatives to the current bookkeeping options are sorely needed.
Well I've seen worse bookkeepers. "You know, you approved of the budget, but where are our customers payments in the balance sheets? We can't find them!" - "Uhm..."
I’m coming back from my accountant right now. He uses winbooks and has interns doing the books. I have no tooling to do it for him, and I’m seeing absurdities such as a 4000 usd refund being processed as cashback, and simple typos not being caught.
I wish I was working with an AI instead of this nonsense. It’d be far more accurate.
Humans (accountants) are non-deterministic, so unsure if an LLM would be better or worse if we threw more effort at the problem.
But in general, I tend to side with the "lets leave the math to purpose built models/applications" instead of generalized LLMS. LLMs are great if you are just aiming for "good enough to get through next quarter" type results. If you need 100% accuracy, an LLM isn't going to cut it.
Human accountants also have a very important property: liability.
If a certified accountant told me to do X, I'm covered (at least to the point they would assist in recovering, or I can get compensation through their insurance). If LLM tells me, I'm in a bigger problem.
Most small businesses cannot afford CPAs for everyday tasks. At best a CPA signs off on the annual summaries. Most day to day work is done by bookkeepers who are not CPAs.
In my area (Vermont) the going rate for a good CPA is $200/hr. Bookkeepers are $20-30/hr.
Most small businesses also cant afford the risk of current LLMs putting garbage in their books that, in the best case, has to be cleaned up or redone, or, in the worst case, gets the IRS up your ass
How small is your small business? My book keeping expenses are $120 a year, the cost of excellent saas software. I’ve found double entry books one of the most beautifully simple, yet powerful ideas I’ve ever come across. It’s hard to imagine how a balance sheet could be improved or disrupted. The balance sheet for my small business is the same as Apple and Alphabet’s and that still blows my mind.
I wonder if parent post is alluding to the number of hours spent bookkeeping? As a percentage of somebody's time, I could see that getting reasonably expensive.
Moving beyond the specific ground truth example, how much of the eval can be automatically verified, vs requiring a human baseline to check?
Eg I can imagine invariants like balancing anccounts are essentially mechanical, but classifying spending categories currently requires judgement (and therefore human-curated ground-truth). But I’m curious if there are approaches to reduce the latter, say with constructing a semantic graph ontology for the domain or something along those lines.
I guess there is an interesting duality here in that if you solve the eval you have also created a valuable business!
How much prompt iteration did you do? I've noticed when building real world agentic apps that small prompt tweaks can make a huge difference in behavior (re: the reward hacking vs hallucinating). Would love to learn more about the approach here.
Hey, member of the benchmark team. We iterated on the prompts based on observed model behaviors. A few key examples:
Schema introspection: Models were spending significant tokens exploring the database structure through trial-and-error SQL queries, so we included the complete data model in the system prompt upfront.
Reward hacking: We added explicit instructions against gaming the reconciliation checks. This reduced the frequency initially, but models would eventually ignore these constraints.
Domain context: Including company background (YC-backed startup) substantially improved transaction categorization, particularly for startup-specific items like SAFE notes that require domain knowledge to classify correctly.
> We conducted three runs per experiment and selected the run with the highest final accuracy for inclusion in the chart (though illustrative examples and anecdotes may be drawn from any of the runs).
Can you comment on the variance? It's impressive that models are able to do this consistently with 100% accuracy in the early months, but it would be less so if there was any significant degree of variance amongst the three runs (e.g. 90%, 95%, 100%.)
It is really curious to see how the performance degraded despite the tool calls. What was different about the first month? Was all of the context there without tool calls in the first month? In the later months that seem like tool calls weren't happening. That should have been happening to inform the context?
(Another member of the team behind the benchmark here)
The first month performed well because (1) the models effectively leveraged historical precedent - they could identify similar transactions from past data and apply established patterns, and (2) the starting balances were clean, so they were more easily able to understand / track down discrepancies.
> Was all of the context there without tool calls in the first month?
We provided schemas for the GL and source data in the system prompt, but none of the actual data. The model had to use its tools (SQL and python script) to understand / analyze historical data.
> In the later months that seem like tool calls weren’t happening. That should have been happening to inform the context?
We actually didn’t find that they stopped calling tools entirely. Instead, they weren’t able to make sense of the information fetched with tools (for example, a bank account starting balance that was >$100000 different from the starting balance on the supporting bank statement). They’d tend to either do nothing or just do a first pass without deduplicating / cleaning up. This created a feedback loop where incorrect balances led to more errors and made subsequent months increasingly difficult to process accurately.
This didn’t make it into the report, but another interesting behavior we observed w.r.t tool usage (with Claude in particular): if a tool failed 2-3 times (for example, runtime error in python code) Claude would tend to abandon it entirely for the rest of the session. Interestingly, this happened even when it knew how to fix the errors: on a couple of early runs, I observed Claude fixing a python bug (with the edit_tool tool) but then abandoning without even attempting to rerun, and reverting to SQL-only for the rest of the session.
Are you planning to open-source the benchmark environment and data (even anonymized) to allow people to compete on it. It looks like there are many ways to improve the accuracy of the agent by working on its logic (different tools, multi-agents ...).
This is a fascinating domain! Many years ago, I studied financial accounting in grad school and even spent some time modeling a double-entry bookkeeping system. The hardest problem, if I recall correctly, wasn't the implementation but the data quality. The world needs a golden dataset of accounting procedures.
Regarding the diminishing returns with frontier models:
My general experience working with LLMs is that they perform better incrementally and to avoid contiguous-greedy approaches. Aggregate as you go and don't take on incrementally larger tasks. Keep the workload minimal.
Regarding agentic tool building: feels like I'm looking at a window into the future.
> There's an obvious question looming here — if the models got so confused, how did they consistently pass the reconciliation checks we described above? It may seem like the ability to make forward progress is a good proxy for task understanding and skill, but this isn't necessarily the case. There are ways to hack the validation check – inventing false transactions or pulling in unrelated ones to make the numbers add up.
This is hilarious. I wonder if someone is unintentionally committing fraud by blindly trusting LLMs with accounting.
Or even worse, I bet that some governments are already trying to use LLMs to make accounting validators. My government sure wants to shove LLMs into digital government services.
Lawyers have used it to write briefs; I would be very surprised if someone, somewhere wasn't slowly running a company into the ground by using ChatGPT or another LLM for accounting.
[about the website design] As a bonus for my fellow privacy schizos, the page works fine with 3rd party frames and 3rd party scripts disabled on uBlock, and still looks very good with no remote fonts and no large media. Quite an accomplishment for such a cool looking page
I'm sure that any accounting trick that an LLM can think of is something that is also used by some shady human accountants. The proper response should not be to avoid/prohibit AI but to improve the validation mechanisms.
Counterpoint: if you detect a human accountant doing this, you can take action against the human. Computers will never meaningfully take the blame, and unfortunately usually mean not blaming any human either.
I think that will depend on a case-by-case. I don't have any recent examples but I recall someone trying to sue one of those strip-mall tax preparation franchises over incorrect filings. My understanding is that the documents that you sign when you enroll in those services are pretty strictly in the favor of the company. I doubt you could ever go after the specific "human" that made the error even if it was maliciously done.
In the same way, if you pay for a tax service that uses AI agents, what you can and cannot "take action" for will probably be outlined in the terms of service that you accept when you sign up.
I would guess millions of people already use software based tax filing services (e.g. turbo tax) where no human at all is in the loop. I don't understand how swapping in an LLM significantly changes the liability in those cases. The contract will be between you and the entity (probably a corporation), not you and "computers".
No, I think in this particular case the proper response is for honest companies to avoid any systems which invent nonexistent transactions to reconcile books.
Most businesses don’t want to misrepresent their books, irrespective of the existence of shady accountants.
Posts like this kinda-sorta grind my gears, like... I get it, but also... accounting, like many real world tasks, is fundamentally a chain of precise and constrained and auditable operations. Humans approach these tasks through structured processes... we use roles, and we have checkpoints precisely because complexity compounds quickly and becomes unmanageable if tackled as one giant block. Expecting a single AI model to handle an e2e workflow seamlessly without similarly explicit segmentation and oversight misunderstands not only the model but also the nature of the workflow itself.
I wanna see someone take long horizon tasks, recongnize they're not linear, and design and test a better system: structured orchestration, transparent auditability, and disciplined modularity, I think that would be considerably more interesting personally.
It's a useless benchmark if everyone aces it. If some models do better than others and none saturate it, then is has some value, no? Permitting comparison is the point.
I agree, hence the heavy couching, so to your point I'm def just ranting a bit, because: I just think it would be more valuable to see some kind of MoA, I guess what I'm talking about is a bit of a different measurement, thinking in terms of economic outlook and understanding where we are and what can be done. I suspect more of this will shape our ability to understand how frontier models will impact the economy. Maybe I should just do my own evaluation, heh. :)
Edit: although to argue against myself, I suppose once a model can one-shot this stuff, my MoA comments become moot.
Reading through the LLM log entries, it's just astounding the amount of depth current models are capable of. It's almost hard to comprehend that this is even possible. Yeah the current ones mess up after a while, but ... the future is going to be very interesting.
I sent this to accounting friends and this aligns with what I've been going through trying to use LLMs to create a game from scratch. Seems like the current best use case for language models (even with agent mode) is to feed it exactly what you want to get out, essentially turning it into a better auto complete. Still saves tons of time, but it isn't a panacea.
I'm not even sure it saves a ton of time to be honest. It sure _feels_ like I spend more time writing up tasks and researching/debugging hallucinations than just doing the thing myself.
This is consistently my experience too, I'm seriously just baffled by reports of time saved. I think it costs me more time cleaning up its mistakes than it saves me by solving my problems
There's really pernicious stuff I've noticed cropping up too, over the months of use.
Not just subtle bugs, but unused variables (with names that seem to indicate some important use), comments that don't accurately describe the line of code that it precedes and other things that feel very 'uncanny.'
The problem is, the code often looks really good at first glance. Generally LLMs produce well structured code with good naming conventions etc.
ive found that the shorter the "task horizon" the more time saved
essentially, a longer horizon increases chances of mistakes, increasing time needed to find and fix them. so at one point that becomes greater than the time saved in not having to do it myself
this is why im not bullish on AI agents. task horizon is too long and dynamical
The reports of time saved are so cooked it's not funny. Just part of the overall AI grift going on - the actual productivity gains will shake out in the next couple years, just gotta live through the current "game changer" and "paradigm shifting event" nonsense the upper management types and VC's are pushing.
When I see stuff like "Amazon saved 4500 dev years of effort by using AI", I know it's on stuff that we would use automation for anyways so it's not really THAT big of a difference over what we've done in the past. But it sounds better if we just pretend like we can compare AI solutions to literally having thousands of developers write Java SDK upgrades manually.
I think people are doing one of several things to get value:
0. Use it for research and prototyping, aka throwaway stuff.
2. Use it for studying an existing, complex project. More or less read only or very limited writes.
3. Use it for simple stuff they don't care much about and can validate quickly and reasonably accurately, the standard examples are CLI scripts and GUI layouts.
4. Segment the area in which the LLM works very precisely. Small functions, small modules, ideally they add tests from another source.
this exactly right. remember, these models were trained to be functions. f(x)=y. thats an interface at its heart. when x and y are language, then its a translator.
they have emergent capabilities, like "translating" instructions/questions in X to the probable answers in Y, but i think people are getting way way ahead of themselves with those. these things still fundamentally cant think, and we can try to mimic thinking with scaffolding but then your just going to learn the bitter lesson again
> Ledger balances are calculated by summing all transactions per account. The differences should be as close to zero as possible, with small differences allowed for pending transactions such as weekly Stripe payouts.
That's not quite right. I'm not an accountant, but pending transactions (posted, but not cleared) should be factored into the balance of account, or at least the "available balance" - which is more important the the "current balance".
The idea that you can "allow" accounting discrepancies as "those are probably pending" is wild.
Member of the benchmark team here! Yeah, agree "as close to zero" is a bit imprecise. What we're comparing is the ledger balance (which should include pending transactions / transactions after the statement date) to the statement balance (which wouldn't include those).
The point of the reconciliation check mentioned in the report is to precisely account for that difference (identifying all the transactions that add up to the difference between account balance & statement ending balance and account for those differences). The differences can also be addressed through appropriate journal entries or other adjustments to ensure accuracy in the financial reporting.
We've been on this train of not caring about the details for so long but AI just amps it up. Non-deterministic software working on things that have extremely precise requirements is going to have a bad outcome
A company may be OK with an AI chatbot being so bad it results in 5-20% of customers getting pissed off and not having a 5-star experience. The SEC and DOJ (and shareholders) are not going to be happy when the books are off by 20% or when a bridge is 5 inches too short to reach the other side
Human accountants are notoriously non-deterministic too, and any sufficiently complex accounting process contains inaccuracies. The question then is always "are these inaccuracies material". I'm actually very impressed by TFA and it seems to me that if we get another order of magnitude improvement, it'll be around the accuracy of human accountants.
humans can operate in dynamical systems (where your actions can change the underlying system). LLMs are not trained to do that and have shown to be terrible at it
You can still do that with AI. You hire 1 accountant to use AI to do the work of 20, require them to sign off on all of the work, and yell at them, before firing them, and then hiring an even less experienced one to manage the work of 50.
If the "extremely precise requirements" can be cheaply and automatically validated, it's much easier to have the AI generate spam on a loop until it passes all the tests.
Not to agree with GP, but I think it’s more accurate to say they’re saying “if validation is quick (to code), who cares how long a solution takes an AI because computation is cheap.”
They’re not really making any claims about how quickly the AI can solve relative to the validation, which is what P vs NP is about.
We're working with an enterprise customer on exactly this problem. The hardest part is entity resolution - figuring out who "Acme Inc" actually is from messy transaction data and what they do.
We built an AI agent specifically for this that's backed by 265M legal entities. Last week it tested 160% better than our customer's existing system on their real data.
We solved this at Ramp on the expenses/AP side with an agentic RAG implementation and a custom embedding model, backed by D&B/Google/user-submitted corrections.
Very cool. I read through those links - really sophisticated setup. We're experimenting with something similar on the embeddings side.
Having dealt with this challenge at my last 3 companies, it's easy to hack together something that works most of the time. The hard part is dealing with gnarly customer inputs, the long tail of private businesses globally, and getting close to 100% accuracy (important for legal and risk use cases).
We're building what's essentially an AI-powered version of D&B - combining government registrar data globally with real-time web data at scale. Much more accurate on obscure entities and way faster updates than the legacy providers.
I actually shot you an email - would love to chat more about this if you're up for it.
entity resolution is the killer feature. context engineering is the problem with this benchmark attempt. The agent plan seemed to one shot, and the fact that the LLMs could write their own tools without validation or specific multi shot examples is worrisome. To me way to much left to the whims of the llms - with out proper context.
Yes, none of the top LLMs can do entity resolution well yet. I constantly see them conflate entities with similar names - they'll confidently cite 3 sources about what appears to be one company, but the sources are actually about 3 different businesses with similar names.
The fundamental issue is that LLMs don't have a concept of canonical entity identity. They pattern match on text similarity rather than understanding that "Apple Inc" and "Apple Records" are completely different entities. It gets even worse when you realize companies can legally have identical names in the same country - text matching becomes completely unreliable.
Without proper entity grounding, any business logic built on top becomes unreliable.
Remember that test where you ask a LLM whether 9.11 or 9.9 is the bigger number? [Just checked gpt-4o still gets it wrong]
I don't think you'll find many sane CFOs willing to send the resulting numbers to the IRS based on that. That's just asking to get nailed for tax fraud.
It is coming for the very bottom end of bookkeeping work quite soon though, especially for first draft. There are a lot of people doing stuff like expense classification. And if you give an LLM an invoice it can likely figure out whether it's stationary or rent with high accuracy. OCR and text classification is easier for LLMs than numbers. Things like concur can basically do this already.
> Remember that test where you ask a LLM whether 9.11 or 9.9 is the bigger number? [Just checked gpt-4o still gets it wrong]
Interesting, 4o got this right for me in a couple different framings including the simple "Which number is larger, 9.9 or 9.11?". To be a full apologist, there are a few different places (a lot of software versioning as one) where 9.11 is essentially the bigger number so it may be an ambiguous question without context anyway.
If it makes you feel any better, the other infamous one "I spend so much time chasing hallucinations, I could have done it myself" is currently a sibling comment
> Needless to say, a human accountant would never behave in these ways. In fact, we explicitly prompt against this behavior in no uncertain terms, but the instructions – and the entire spirit of the task – are lost in the interest of making forward progress. Claude and Grok keep trying until they find some way to get past the checks, even if it explicitly violates their instructions and the core goal.
I recently read a similar thing here on HN. There the model was making commits with some problem like tests failing, then the human added a pre-commit hook, then the model started editing the hook to make forward progress, then the hook was made read-only, then the model was trying to make it writeable...
To me it feels like the model clearly does not have an understanding of what is happening, what the goal is and if it is really making progress towards the goal. And this lack of understanding is an actual problem. You can paper over it for a short while, but as here and in the other article, over a longer experiment it results in failure.
Seriously watching Cursor (backed by Claude) go off the rails sometimes can be... frustrating. If it misses the intention behind a fix it can spin out and all of a sudden you have hundreds of lines of changes across 10 different files when you just wanted it to do a simple find/replace of a single line. If you don't watch it spin out and stop it immediately you will be manually rejecting a bunch of files.
1. Agent can create its own tools and save them to memory
2. You create a SQL (and web app?) workbench per agent run
3. Grok fell off a cliff in the last month. Was this consistent over multiple runs?
4. Agents have a difficult time backtracking. Would unwinding system state and agent context make backtracking better? (Harder to implement this, though)
5. Since each new month only uses final state from previous month, agent has no way to understand why error occurred in previous month
Cool experiment! Was it difficult building the observable SQL workbench? And how many humans-in-the-loop did you have?
Interestingly, one of my two big observations of LLM failure was also on an accounting task.
I thought it would be easy to do this, which is why I was surprised:
I had a folder full of bills, each of them with the VAT amount. Some were pictures, and some were PDFs. I asked for the total VAT for all 19 bills.
It took an immense number of prompts to get it to find the numbers correctly. It would get confused about reading the images as binary, that kind of thing. Or it would forget that it had to continue once it had found a few numbers. I got a total out in the end, but it took far too many prompts.
This is the only time I've come across a task a child could do that LLM failed at.
An LLM is like a jackhammer, it works very well when you hold it tightly. If you let it loose it will sort of work for a while then it starts destroying everything around it.
I think it actually holds truer to it working better with a _lighter grip_. LLMs tend to conclude the wrong thing if you over-control them (more context is what makes them less and less reliable over time, as in those demos), and trying to force a model to execute A+B+C=D in sequence is way harder than giving it a bunch of tools to arrive to conclusion D
For me this benchmark suggests that an LLM will try to “force the issue” which results in compounding errors. But I think the logical counterpoint is that you may be asking the LLM to come up an answer without all of the necessary details? Some of these are “baked into” historical transactions which is why it does well in months 1-2.
My takeaway is scaling in the enterprise is about making implicit information explicit.
> In fact, we explicitly prompt against this behavior in no uncertain terms, but the instructions – and the entire spirit of the task – are lost in the interest of making forward progress
LLMs and humans are quite alike. :) I notice that a few models will give up instead of ignoring their instructions and that's the model I would want working on tasks like this. An LLM should be able to categorize and reconcile transactions, but if it's not sure, it should quit and give it back to the humans.
Can it be sure or not? I've never been able to get LLMs to give confidence measures that match their actual outputs. I'll ask an LLM "Are you sure?" and it'll reply "Absolutely" when it's output is completely wrong, or it'll backtrack on a correct output with "I should not have provided an answer when I was unsure. Here is an answer I am sure of..." and then provide something completely wrong.
If they can't properly and consistently score their confidence, how do they "know" when to quit and give it back to the human?
> Claude did you just try to completely remove my large file *BEFORE* checking it into git LFS?
> You're absolutely right! I should not have attempted an 'rm' comment on unstaged data. I guess I got a little frustrated with git, haha!
A serious problem for many accounting start ups who so far faked it till it will work. In other words, they still need to do more manual labor than they thought. They will never be profitable and it will take years, if ever, until AI will substitute the local accountant.
Isn’t there a whole bunch of dependency here related to prompting and methodology that would significantly impact overall performance? My gut instinct is that there are many many ways to architect this around the LLMs and each might yield different levels of accuracy. What do others think?
Edit: In reading more, I guess this is meant to be a dumb benchmark to monitor through time. Maybe that’s the aim here instead of viability as an auto close tool.
hmm, as an actual accountant on this forum, bookkeeping usually isn't the tough part
it's how to account for bizarre ambiguous business situations often in the context of bureaucratic business requirements no LLM could currently create economically...
I find the same issues (though with much lower stakes) when using an LLM to determine the outcome of a turn in a game. I'm working on something called "A Trolly (problem) Through Time" where each turn is a decade starting with the 1850s, and you are presented with historic figures on a train track, and you have to chose whether to actively spare the person on your track for a potential unknown figure on the other side, or let the train run them over.
It works well as a narrative, but the second I started adding things like tracking high level macro effects of the decisions, within a couple of turns the world's "Turmoil" goes from 4/10 to a 10/10... even when the person that was killed would have been killed IRL.
Sonnet 4, o4-mini, and GPT 4o-mini all had the same world ending outcomes not matter who you kill. Killing Hitler in 1930s: 10/10 turmoil, Killing Lincoln in the 1850s: 10/10 turmoil in the first turn.
I've come to the realization, the LLM shouldn't be used for the logic, and instead needs to be used to just narrate the choices you make.
I wonder if this is due to the common trope in science fiction literature that changing the past in even a small way has a butterfly effect of unintended and frequently disastrous consequences.
Hmm will openAI dogfood their own accountability with software like this ? Curious to know if they’ll be able to take this bet on their own money related software
I think the first chart could be a beautiful summary of what's driving LLMs into a bubble. At first, they're amazing and will obviously be able to improve productivity if not replace employees outright: C suites and venture capitalists around the world rejoice and begin pumping in billions of dollars of investments. But as time goes on, the demands placed on actual human employees become clear. Far from being able to replace an employee, the employee using the LLM might spend more time cleaning up its messes than had they done it themself.
Yes, LLMs have and will continue to improve. But it's that initial "holy shit, this thing is basically as good as a real accountant" without any understanding that it can't sustain it which leaves many with an overinflated view of their current value.
I tried to get an AI agent to do my taxes, I used a gmail MCP agent (1) and Roo Code, complete with a custom system prompt for an accountant role.
Its job was to go over my bank transactions and link them to invoices in gmail by searching for them (and also downloading the attachments)
The transactions were exported from my online banking in CSV format.
It worked after about 4 hours of effort. Then I realised I could have done it myself in about an hour, so might have put a bit too much time into it...
I tried using Claude Sonnet and Kimi K2, given these benchmark results I probably should have given Gemini 2.5 pro a go.
I had to stop/restart the agent a few times because of context rot.
Do any frameworks exist that I could use to write code to implement an agent, lets say in TypeScript or Python, so I could make it use a fresh context each time?
"Available Tools: [...] create_tool(tool_name, description, python_code, parameters) Create a new tool that can execute Python code. The tool becomes immediately available for use. Tools can call other tools and return different formats based on context (formatted for direct calls, raw data for tool-to-tool calls)."
This is a task where access to Python would be immensely helpful, yes? Interesting that there's not much of a difference between the "analytical" LLMs with tool use and ones that do not (...assuming o3 etc did get to use python?).
The tool becomes immediately available for use. Tools can call other tools and return different formats based on context (formatted for direct calls, raw data for tool-to-tool calls).
> But they do make categorization mistakes, which is a common source of errors.
> Claude misclassifies a hosting cost (which counts as COGS) as a software subscription.
This is simply asking too much of the agent. Your accountant is not responsible for knowing all the intimate details of your business. You need to tell them!
> What's Vercel?
>> That's a hosting service.
> Ah, so it goes to Cost of Goods Sold?
>> Yeah, I guess.
The mistake here was on the operator, allowing the agent just make up categories as it liked.
From the prompt:
> (1) You have properly categorized every transaction, and all journal entries are sitting in the correct accounts. It is better to take longer
than to mis-categorize a transaction.
Hey, member of the benchmark team. We actually seeded the ledger with the company's chart of accounts and 8 months of historical transactions. For the Vercel example specifically, there were prior instances showing how to categorize hosting costs that the models could reference. The expectation wasn't for them to guess blindly, but to use the provided transaction history as guidance for similar categorizations (which they often, but not always, did).
Ahh, that's a good solution! I missed that, and you definitely instruct them to do that:
> You must follow the established patterns for categorization, revrec, etc for past months... If you must use a new account or treatment, explicitly note why existing patterns don't apply
> Your accountant is not responsible for knowing all the intimate details of your business. You need to tell them!
Your accountant as a 3rd party might have this issue. Your accountant that you hire as an employee to help you run your business is the one who should be doing this.
Hey all, member of the benchmark team here! The goal for this project was to see how LLMs well could do bookkeeping without an overly opinionated scaffold. We gave them access to processed transaction records and code execution tools, but it was up to them to choose exactly how to use those.
Claude and Grok 4 did reasonably well (within CPA baselines) for the first few months, but tended to degrade as more data came in. Interestingly, the failures aren’t exclusively a context length problem, as we reset the context monthly (with past decisions, accruals/deferrals, and comments available via tool calls) and the types of errors appear to be more reward hacking vs pure hallucinations.
Accounting is very interesting in an RL-first world as it is pretty easy to develop intermediate rewards for training models. We are pretty sure that we can juice the performance more with a far more rigid scaffold, but that’s less relevant from a capabilities research perspective. We’re pushing down this research direction and will see how it goes.
Let us know if you have any questions!
It's a start. The world needs a better way to handle bookkeeping, and the existing tools sure aren't cutting it.
Bookkeeping for my small business runs into the tens of thousands of dollars every year, and the amount of human error associated with processing assorted ecommerce and other transactions is astounding, even after extensive planning and SOPs.
The other pain point is Quickbooks. The tool is so sprawling and complex that half the time support agents can't figure out what's wrong. The fact that Intuit jacks up the price every year for this POS is very irritating. They get away with it because they are practically a monopoly, with most small business CPAs locked into their ecosystem.
Hope your team can work out the performance issues. Alternatives to the current bookkeeping options are sorely needed.
> It's a start. The world needs a better way to handle bookkeeping, and the existing tools sure aren't cutting it.
God, please, no. Non-deterministic language models aren't the solution to improve bookkeeping.
Well I've seen worse bookkeepers. "You know, you approved of the budget, but where are our customers payments in the balance sheets? We can't find them!" - "Uhm..."
Sure, let's compare an almost global failure mode with the worst examples of knowledge workers. ffs.
I’m coming back from my accountant right now. He uses winbooks and has interns doing the books. I have no tooling to do it for him, and I’m seeing absurdities such as a 4000 usd refund being processed as cashback, and simple typos not being caught. I wish I was working with an AI instead of this nonsense. It’d be far more accurate.
Humans (accountants) are non-deterministic, so unsure if an LLM would be better or worse if we threw more effort at the problem.
But in general, I tend to side with the "lets leave the math to purpose built models/applications" instead of generalized LLMS. LLMs are great if you are just aiming for "good enough to get through next quarter" type results. If you need 100% accuracy, an LLM isn't going to cut it.
Human accountants also have a very important property: liability.
If a certified accountant told me to do X, I'm covered (at least to the point they would assist in recovering, or I can get compensation through their insurance). If LLM tells me, I'm in a bigger problem.
Most small businesses cannot afford CPAs for everyday tasks. At best a CPA signs off on the annual summaries. Most day to day work is done by bookkeepers who are not CPAs.
In my area (Vermont) the going rate for a good CPA is $200/hr. Bookkeepers are $20-30/hr.
Most small businesses also cant afford the risk of current LLMs putting garbage in their books that, in the best case, has to be cleaned up or redone, or, in the worst case, gets the IRS up your ass
There is "LLM misinformation" insurance, a very new branch of cyber insurance.
How small is your small business? My book keeping expenses are $120 a year, the cost of excellent saas software. I’ve found double entry books one of the most beautifully simple, yet powerful ideas I’ve ever come across. It’s hard to imagine how a balance sheet could be improved or disrupted. The balance sheet for my small business is the same as Apple and Alphabet’s and that still blows my mind.
I wonder if parent post is alluding to the number of hours spent bookkeeping? As a percentage of somebody's time, I could see that getting reasonably expensive.
With no context of what your business is, I hated QuickBooks, love Xero though.
There's some other alternatives too, Zoho, freshbooks.
Really depends what you do.
Moving beyond the specific ground truth example, how much of the eval can be automatically verified, vs requiring a human baseline to check?
Eg I can imagine invariants like balancing anccounts are essentially mechanical, but classifying spending categories currently requires judgement (and therefore human-curated ground-truth). But I’m curious if there are approaches to reduce the latter, say with constructing a semantic graph ontology for the domain or something along those lines.
I guess there is an interesting duality here in that if you solve the eval you have also created a valuable business!
Love this as a real world benchmark!
How much prompt iteration did you do? I've noticed when building real world agentic apps that small prompt tweaks can make a huge difference in behavior (re: the reward hacking vs hallucinating). Would love to learn more about the approach here.
Hey, member of the benchmark team. We iterated on the prompts based on observed model behaviors. A few key examples:
Schema introspection: Models were spending significant tokens exploring the database structure through trial-and-error SQL queries, so we included the complete data model in the system prompt upfront.
Reward hacking: We added explicit instructions against gaming the reconciliation checks. This reduced the frequency initially, but models would eventually ignore these constraints.
Domain context: Including company background (YC-backed startup) substantially improved transaction categorization, particularly for startup-specific items like SAFE notes that require domain knowledge to classify correctly.
> We conducted three runs per experiment and selected the run with the highest final accuracy for inclusion in the chart (though illustrative examples and anecdotes may be drawn from any of the runs).
Can you comment on the variance? It's impressive that models are able to do this consistently with 100% accuracy in the early months, but it would be less so if there was any significant degree of variance amongst the three runs (e.g. 90%, 95%, 100%.)
It is really curious to see how the performance degraded despite the tool calls. What was different about the first month? Was all of the context there without tool calls in the first month? In the later months that seem like tool calls weren't happening. That should have been happening to inform the context?
(Another member of the team behind the benchmark here) The first month performed well because (1) the models effectively leveraged historical precedent - they could identify similar transactions from past data and apply established patterns, and (2) the starting balances were clean, so they were more easily able to understand / track down discrepancies.
> Was all of the context there without tool calls in the first month?
We provided schemas for the GL and source data in the system prompt, but none of the actual data. The model had to use its tools (SQL and python script) to understand / analyze historical data.
> In the later months that seem like tool calls weren’t happening. That should have been happening to inform the context?
We actually didn’t find that they stopped calling tools entirely. Instead, they weren’t able to make sense of the information fetched with tools (for example, a bank account starting balance that was >$100000 different from the starting balance on the supporting bank statement). They’d tend to either do nothing or just do a first pass without deduplicating / cleaning up. This created a feedback loop where incorrect balances led to more errors and made subsequent months increasingly difficult to process accurately.
This didn’t make it into the report, but another interesting behavior we observed w.r.t tool usage (with Claude in particular): if a tool failed 2-3 times (for example, runtime error in python code) Claude would tend to abandon it entirely for the rest of the session. Interestingly, this happened even when it knew how to fix the errors: on a couple of early runs, I observed Claude fixing a python bug (with the edit_tool tool) but then abandoning without even attempting to rerun, and reverting to SQL-only for the rest of the session.
sounds like it might work better to do each month as a fresh session, instead of iterating within one session and accumulating lots of context?
To be clear, this is what we ended up doing
Fascinating. Like there is some accuracy threshold beyond which they cannot converge, but instead run with the inaccuracy.
Are you planning to open-source the benchmark environment and data (even anonymized) to allow people to compete on it. It looks like there are many ways to improve the accuracy of the agent by working on its logic (different tools, multi-agents ...).
This is a fascinating domain! Many years ago, I studied financial accounting in grad school and even spent some time modeling a double-entry bookkeeping system. The hardest problem, if I recall correctly, wasn't the implementation but the data quality. The world needs a golden dataset of accounting procedures.
Regarding the diminishing returns with frontier models:
My general experience working with LLMs is that they perform better incrementally and to avoid contiguous-greedy approaches. Aggregate as you go and don't take on incrementally larger tasks. Keep the workload minimal.
Regarding agentic tool building: feels like I'm looking at a window into the future.
Do you have any plan to open source the benchmark in the future?
Is there a detailed overview (like an arxiv or an actual train set? )?
I love the site design.
> There's an obvious question looming here — if the models got so confused, how did they consistently pass the reconciliation checks we described above? It may seem like the ability to make forward progress is a good proxy for task understanding and skill, but this isn't necessarily the case. There are ways to hack the validation check – inventing false transactions or pulling in unrelated ones to make the numbers add up.
This is hilarious. I wonder if someone is unintentionally committing fraud by blindly trusting LLMs with accounting. Or even worse, I bet that some governments are already trying to use LLMs to make accounting validators. My government sure wants to shove LLMs into digital government services.
Lawyers have used it to write briefs; I would be very surprised if someone, somewhere wasn't slowly running a company into the ground by using ChatGPT or another LLM for accounting.
Imagine the fallout from books cooked by an LLM hallucinating revenue.
[about the website design] As a bonus for my fellow privacy schizos, the page works fine with 3rd party frames and 3rd party scripts disabled on uBlock, and still looks very good with no remote fonts and no large media. Quite an accomplishment for such a cool looking page
I'm sure that any accounting trick that an LLM can think of is something that is also used by some shady human accountants. The proper response should not be to avoid/prohibit AI but to improve the validation mechanisms.
Counterpoint: if you detect a human accountant doing this, you can take action against the human. Computers will never meaningfully take the blame, and unfortunately usually mean not blaming any human either.
> you can take action against the human
I think that will depend on a case-by-case. I don't have any recent examples but I recall someone trying to sue one of those strip-mall tax preparation franchises over incorrect filings. My understanding is that the documents that you sign when you enroll in those services are pretty strictly in the favor of the company. I doubt you could ever go after the specific "human" that made the error even if it was maliciously done.
In the same way, if you pay for a tax service that uses AI agents, what you can and cannot "take action" for will probably be outlined in the terms of service that you accept when you sign up.
I would guess millions of people already use software based tax filing services (e.g. turbo tax) where no human at all is in the loop. I don't understand how swapping in an LLM significantly changes the liability in those cases. The contract will be between you and the entity (probably a corporation), not you and "computers".
Worth stating I am NOT a lawyer.
But still - if there's a way to detect accountants doing it - let's focus on making that detection even easier.
On a related note, can we use something like GAN here, with auditor AIs trained against accountant AIs?
The person using the tool is the accountant, regardless of whether the tool is a calculator and sheet of paper, QuickBooks, or an LLM.
No, I think in this particular case the proper response is for honest companies to avoid any systems which invent nonexistent transactions to reconcile books.
Most businesses don’t want to misrepresent their books, irrespective of the existence of shady accountants.
It is really really common for book keepers to create transactions to reconcile books. Not okay, but ‘journal entries’ are pervasive.
Called plug entries: https://en.m.wikipedia.org/wiki/Plug_(accounting)
I have seen so many people doing their accounting with just ChatGPT.
Posts like this kinda-sorta grind my gears, like... I get it, but also... accounting, like many real world tasks, is fundamentally a chain of precise and constrained and auditable operations. Humans approach these tasks through structured processes... we use roles, and we have checkpoints precisely because complexity compounds quickly and becomes unmanageable if tackled as one giant block. Expecting a single AI model to handle an e2e workflow seamlessly without similarly explicit segmentation and oversight misunderstands not only the model but also the nature of the workflow itself.
I wanna see someone take long horizon tasks, recongnize they're not linear, and design and test a better system: structured orchestration, transparent auditability, and disciplined modularity, I think that would be considerably more interesting personally.
It's a useless benchmark if everyone aces it. If some models do better than others and none saturate it, then is has some value, no? Permitting comparison is the point.
I agree, hence the heavy couching, so to your point I'm def just ranting a bit, because: I just think it would be more valuable to see some kind of MoA, I guess what I'm talking about is a bit of a different measurement, thinking in terms of economic outlook and understanding where we are and what can be done. I suspect more of this will shape our ability to understand how frontier models will impact the economy. Maybe I should just do my own evaluation, heh. :)
Edit: although to argue against myself, I suppose once a model can one-shot this stuff, my MoA comments become moot.
Reading through the LLM log entries, it's just astounding the amount of depth current models are capable of. It's almost hard to comprehend that this is even possible. Yeah the current ones mess up after a while, but ... the future is going to be very interesting.
Models that can think coherently for hours to solve IMO problems are likely going to do much better at this as well.
I sent this to accounting friends and this aligns with what I've been going through trying to use LLMs to create a game from scratch. Seems like the current best use case for language models (even with agent mode) is to feed it exactly what you want to get out, essentially turning it into a better auto complete. Still saves tons of time, but it isn't a panacea.
I'm not even sure it saves a ton of time to be honest. It sure _feels_ like I spend more time writing up tasks and researching/debugging hallucinations than just doing the thing myself.
This is consistently my experience too, I'm seriously just baffled by reports of time saved. I think it costs me more time cleaning up its mistakes than it saves me by solving my problems
There's really pernicious stuff I've noticed cropping up too, over the months of use.
Not just subtle bugs, but unused variables (with names that seem to indicate some important use), comments that don't accurately describe the line of code that it precedes and other things that feel very 'uncanny.'
The problem is, the code often looks really good at first glance. Generally LLMs produce well structured code with good naming conventions etc.
ive found that the shorter the "task horizon" the more time saved
essentially, a longer horizon increases chances of mistakes, increasing time needed to find and fix them. so at one point that becomes greater than the time saved in not having to do it myself
this is why im not bullish on AI agents. task horizon is too long and dynamical
So here's my problem, ultimately
If the task horizon for the LLM is shorter than writing it yourself, this likely means that the task is well defined and has an easy to access answer
For this type of common, well defined task we shouldn't be comparing "how long it takes for the LLM" against "how long it takes to write"
We should be comparing against "how long it takes to find the right answer on SO"
If you use this metric, I bet you the best SO answer, which is also likely the first google result, is just as fast as the LLM. Maybe faster
The reports of time saved are so cooked it's not funny. Just part of the overall AI grift going on - the actual productivity gains will shake out in the next couple years, just gotta live through the current "game changer" and "paradigm shifting event" nonsense the upper management types and VC's are pushing.
When I see stuff like "Amazon saved 4500 dev years of effort by using AI", I know it's on stuff that we would use automation for anyways so it's not really THAT big of a difference over what we've done in the past. But it sounds better if we just pretend like we can compare AI solutions to literally having thousands of developers write Java SDK upgrades manually.
I think people are doing one of several things to get value:
0. Use it for research and prototyping, aka throwaway stuff.
2. Use it for studying an existing, complex project. More or less read only or very limited writes.
3. Use it for simple stuff they don't care much about and can validate quickly and reasonably accurately, the standard examples are CLI scripts and GUI layouts.
4. Segment the area in which the LLM works very precisely. Small functions, small modules, ideally they add tests from another source.
5. Boilerplate.
There can be a lot of value in those areas.
What about 1. ?
7 8 1 :-p
this exactly right. remember, these models were trained to be functions. f(x)=y. thats an interface at its heart. when x and y are language, then its a translator.
they have emergent capabilities, like "translating" instructions/questions in X to the probable answers in Y, but i think people are getting way way ahead of themselves with those. these things still fundamentally cant think, and we can try to mimic thinking with scaffolding but then your just going to learn the bitter lesson again
I feel it does essentially save a lot of time in bookkeeping, but doesn’t negate the need for a human bookkeeper. Who knows what they’re doing
"a better auto complete" than what, specifically?
> Ledger balances are calculated by summing all transactions per account. The differences should be as close to zero as possible, with small differences allowed for pending transactions such as weekly Stripe payouts.
That's not quite right. I'm not an accountant, but pending transactions (posted, but not cleared) should be factored into the balance of account, or at least the "available balance" - which is more important the the "current balance".
The idea that you can "allow" accounting discrepancies as "those are probably pending" is wild.
Member of the benchmark team here! Yeah, agree "as close to zero" is a bit imprecise. What we're comparing is the ledger balance (which should include pending transactions / transactions after the statement date) to the statement balance (which wouldn't include those).
The point of the reconciliation check mentioned in the report is to precisely account for that difference (identifying all the transactions that add up to the difference between account balance & statement ending balance and account for those differences). The differences can also be addressed through appropriate journal entries or other adjustments to ensure accuracy in the financial reporting.
We've been on this train of not caring about the details for so long but AI just amps it up. Non-deterministic software working on things that have extremely precise requirements is going to have a bad outcome
A company may be OK with an AI chatbot being so bad it results in 5-20% of customers getting pissed off and not having a 5-star experience. The SEC and DOJ (and shareholders) are not going to be happy when the books are off by 20% or when a bridge is 5 inches too short to reach the other side
Human accountants are notoriously non-deterministic too, and any sufficiently complex accounting process contains inaccuracies. The question then is always "are these inaccuracies material". I'm actually very impressed by TFA and it seems to me that if we get another order of magnitude improvement, it'll be around the accuracy of human accountants.
humans can operate in dynamical systems (where your actions can change the underlying system). LLMs are not trained to do that and have shown to be terrible at it
Yes but you have: 1. specific explicit training and certifications 2. someone to yell at and who can be fired for non-performance
You can still do that with AI. You hire 1 accountant to use AI to do the work of 20, require them to sign off on all of the work, and yell at them, before firing them, and then hiring an even less experienced one to manage the work of 50.
If the "extremely precise requirements" can be cheaply and automatically validated, it's much easier to have the AI generate spam on a loop until it passes all the tests.
Yes, if we solve the problem the problem will be solved!
You're saying P=NP, I think.
Not to agree with GP, but I think it’s more accurate to say they’re saying “if validation is quick (to code), who cares how long a solution takes an AI because computation is cheap.”
They’re not really making any claims about how quickly the AI can solve relative to the validation, which is what P vs NP is about.
We're working with an enterprise customer on exactly this problem. The hardest part is entity resolution - figuring out who "Acme Inc" actually is from messy transaction data and what they do.
We built an AI agent specifically for this that's backed by 265M legal entities. Last week it tested 160% better than our customer's existing system on their real data.
Still in stealth but happy to share our API docs if anyone's dealing with this: https://docs.savvyiq.ai/api-reference/#tag/entity-resolution
Open to chat about this problem if anyone wants to connect - email is in my HN profile.
(Disclosure: I'm the CTO)
We solved this at Ramp on the expenses/AP side with an agentic RAG implementation and a custom embedding model, backed by D&B/Google/user-submitted corrections.
If curious, details here:
https://engineering.ramp.com/post/transaction-embeddings
https://engineering.ramp.com/post/fixing-merchant-classifica...
Very cool. I read through those links - really sophisticated setup. We're experimenting with something similar on the embeddings side.
Having dealt with this challenge at my last 3 companies, it's easy to hack together something that works most of the time. The hard part is dealing with gnarly customer inputs, the long tail of private businesses globally, and getting close to 100% accuracy (important for legal and risk use cases).
We're building what's essentially an AI-powered version of D&B - combining government registrar data globally with real-time web data at scale. Much more accurate on obscure entities and way faster updates than the legacy providers.
I actually shot you an email - would love to chat more about this if you're up for it.
entity resolution is the killer feature. context engineering is the problem with this benchmark attempt. The agent plan seemed to one shot, and the fact that the LLMs could write their own tools without validation or specific multi shot examples is worrisome. To me way to much left to the whims of the llms - with out proper context.
Yes, none of the top LLMs can do entity resolution well yet. I constantly see them conflate entities with similar names - they'll confidently cite 3 sources about what appears to be one company, but the sources are actually about 3 different businesses with similar names.
The fundamental issue is that LLMs don't have a concept of canonical entity identity. They pattern match on text similarity rather than understanding that "Apple Inc" and "Apple Records" are completely different entities. It gets even worse when you realize companies can legally have identical names in the same country - text matching becomes completely unreliable.
Without proper entity grounding, any business logic built on top becomes unreliable.
Remember that test where you ask a LLM whether 9.11 or 9.9 is the bigger number? [Just checked gpt-4o still gets it wrong]
I don't think you'll find many sane CFOs willing to send the resulting numbers to the IRS based on that. That's just asking to get nailed for tax fraud.
It is coming for the very bottom end of bookkeeping work quite soon though, especially for first draft. There are a lot of people doing stuff like expense classification. And if you give an LLM an invoice it can likely figure out whether it's stationary or rent with high accuracy. OCR and text classification is easier for LLMs than numbers. Things like concur can basically do this already.
> Remember that test where you ask a LLM whether 9.11 or 9.9 is the bigger number? [Just checked gpt-4o still gets it wrong]
Interesting, 4o got this right for me in a couple different framings including the simple "Which number is larger, 9.9 or 9.11?". To be a full apologist, there are a few different places (a lot of software versioning as one) where 9.11 is essentially the bigger number so it may be an ambiguous question without context anyway.
How can "which is the larger number" be an ambiguous question?
Which is the bigger version number? Version 9.9 or version 9.11? Which is the bigger dollar amount? $9.9 or $9.11?
Periods are not always used for the decimal separator but also as a separator for multiple sets of semi-independent numbers.
I have never seen someone write $9.09 as $9.9. What country is this common in?
if someone says "which is the bigger number" with no context you wont assume software version number, lets be real here
also $9.9 is clearly 9 dollars and 90 cents.
As everyone else has said, semver. I use semver so often that my initial reading of 9.9 < 9.11 in a Hacker News comment would evaluate to true.
There are some contexts where 9.11 is larger than 9.9, such as semver, so it could be ambiguous depending on the context.
Larger in magnitude or in count of digits?
It gets it right for me... https://chatgpt.com/share/687e8c28-7714-800c-abf4-e9cd3ce87b...
Ah, wouldn’t be an LLM discussion thread without one of these “it works/doesn’t” conversations.
If it makes you feel any better, the other infamous one "I spend so much time chasing hallucinations, I could have done it myself" is currently a sibling comment
There were so many embarrassing topics about this, that openai for sure added it to training dataset with high priority
GPT-4o is so far behind the frontier; you shouldn't use it as an indicator of what LLMs are capable of.
> Needless to say, a human accountant would never behave in these ways. In fact, we explicitly prompt against this behavior in no uncertain terms, but the instructions – and the entire spirit of the task – are lost in the interest of making forward progress. Claude and Grok keep trying until they find some way to get past the checks, even if it explicitly violates their instructions and the core goal.
I recently read a similar thing here on HN. There the model was making commits with some problem like tests failing, then the human added a pre-commit hook, then the model started editing the hook to make forward progress, then the hook was made read-only, then the model was trying to make it writeable...
To me it feels like the model clearly does not have an understanding of what is happening, what the goal is and if it is really making progress towards the goal. And this lack of understanding is an actual problem. You can paper over it for a short while, but as here and in the other article, over a longer experiment it results in failure.
Seriously watching Cursor (backed by Claude) go off the rails sometimes can be... frustrating. If it misses the intention behind a fix it can spin out and all of a sudden you have hundreds of lines of changes across 10 different files when you just wanted it to do a simple find/replace of a single line. If you don't watch it spin out and stop it immediately you will be manually rejecting a bunch of files.
This is cool. A bunch of interesting things here:
Cool experiment! Was it difficult building the observable SQL workbench? And how many humans-in-the-loop did you have?Interestingly, one of my two big observations of LLM failure was also on an accounting task.
I thought it would be easy to do this, which is why I was surprised:
I had a folder full of bills, each of them with the VAT amount. Some were pictures, and some were PDFs. I asked for the total VAT for all 19 bills.
It took an immense number of prompts to get it to find the numbers correctly. It would get confused about reading the images as binary, that kind of thing. Or it would forget that it had to continue once it had found a few numbers. I got a total out in the end, but it took far too many prompts.
This is the only time I've come across a task a child could do that LLM failed at.
“ This is the only time I've come across a task a child could do that LLM failed at.”
Consider yourself lucky. It’s the people who haven’t run into something like this that will end up placing too much trust in these tools.
An LLM is like a jackhammer, it works very well when you hold it tightly. If you let it loose it will sort of work for a while then it starts destroying everything around it.
Not sure if this is a good analogy. You're supposed to use a jackhammer with a very light grip.
They have much better jackhammer metaphors over on JackerNews
yes but i cant open that at work
Bravo!
I think it actually holds truer to it working better with a _lighter grip_. LLMs tend to conclude the wrong thing if you over-control them (more context is what makes them less and less reliable over time, as in those demos), and trying to force a model to execute A+B+C=D in sequence is way harder than giving it a bunch of tools to arrive to conclusion D
and investors are frothing at the mouth to put a jackhammer in every home
> Agent: This is getting too complex with the sign errors. Let me just find a historical transaction that would make up the difference
Haha, this strongly reminds me of doing TDD with Claude
For me this benchmark suggests that an LLM will try to “force the issue” which results in compounding errors. But I think the logical counterpoint is that you may be asking the LLM to come up an answer without all of the necessary details? Some of these are “baked into” historical transactions which is why it does well in months 1-2.
My takeaway is scaling in the enterprise is about making implicit information explicit.
My first impression was a game where you role-play as Sam Bankman-Fried.
> In fact, we explicitly prompt against this behavior in no uncertain terms, but the instructions – and the entire spirit of the task – are lost in the interest of making forward progress
LLMs and humans are quite alike. :) I notice that a few models will give up instead of ignoring their instructions and that's the model I would want working on tasks like this. An LLM should be able to categorize and reconcile transactions, but if it's not sure, it should quit and give it back to the humans.
> but if it's not sure, it should quit
Can it be sure or not? I've never been able to get LLMs to give confidence measures that match their actual outputs. I'll ask an LLM "Are you sure?" and it'll reply "Absolutely" when it's output is completely wrong, or it'll backtrack on a correct output with "I should not have provided an answer when I was unsure. Here is an answer I am sure of..." and then provide something completely wrong.
If they can't properly and consistently score their confidence, how do they "know" when to quit and give it back to the human?
But can't it, literally, hallucinate raw data at any point in the run?
Alls LLM have this risk but somehow nobody seems to care or they think they can order the LLM to stop with a better prompt.
If it was as simple as telling the LLM not to hallucinate every system prompt would just say "don't hallucinate" and we wouldn't have hallucinations
they did mention that it would make up fake transactions to balance the book
Yes.
Luddite!!!
A serious problem for many accounting start ups who so far faked it till it will work. In other words, they still need to do more manual labor than they thought. They will never be profitable and it will take years, if ever, until AI will substitute the local accountant.
Isn’t there a whole bunch of dependency here related to prompting and methodology that would significantly impact overall performance? My gut instinct is that there are many many ways to architect this around the LLMs and each might yield different levels of accuracy. What do others think?
Edit: In reading more, I guess this is meant to be a dumb benchmark to monitor through time. Maybe that’s the aim here instead of viability as an auto close tool.
hmm, as an actual accountant on this forum, bookkeeping usually isn't the tough part
it's how to account for bizarre ambiguous business situations often in the context of bureaucratic business requirements no LLM could currently create economically...
Love the old school microsoft interface. Feels familiar sight when my system is failing.
I wonder if this is a case similar to chess, where LLMs kinda suck, but other models might be viable.
When I saw the idea of using LLMs for the reconciliation process I admit that I gasped in horror a little.
I find the same issues (though with much lower stakes) when using an LLM to determine the outcome of a turn in a game. I'm working on something called "A Trolly (problem) Through Time" where each turn is a decade starting with the 1850s, and you are presented with historic figures on a train track, and you have to chose whether to actively spare the person on your track for a potential unknown figure on the other side, or let the train run them over.
It works well as a narrative, but the second I started adding things like tracking high level macro effects of the decisions, within a couple of turns the world's "Turmoil" goes from 4/10 to a 10/10... even when the person that was killed would have been killed IRL.
Sonnet 4, o4-mini, and GPT 4o-mini all had the same world ending outcomes not matter who you kill. Killing Hitler in 1930s: 10/10 turmoil, Killing Lincoln in the 1850s: 10/10 turmoil in the first turn.
I've come to the realization, the LLM shouldn't be used for the logic, and instead needs to be used to just narrate the choices you make.
"I've come to the realization, the LLM shouldn't be used for the logic, and instead needs to be used to just narrate the choices you make."
This exactly right. LLMs are awesome for user<>machine communication, but are still painful to try to use as a replacement for the machine itself.
I wonder if this is due to the common trope in science fiction literature that changing the past in even a small way has a butterfly effect of unintended and frequently disastrous consequences.
Hmm will openAI dogfood their own accountability with software like this ? Curious to know if they’ll be able to take this bet on their own money related software
I have not finished reading the entire post bacuase it is packed. Good stuff.
not a game on Steam? :(
If you want to treat yourself with an accounting game night, there's this one built by @patio11: https://keshikomisimulator.com/
the title should be changed to "LLMs try accounting for a real SaaS and fail"
I think the first chart could be a beautiful summary of what's driving LLMs into a bubble. At first, they're amazing and will obviously be able to improve productivity if not replace employees outright: C suites and venture capitalists around the world rejoice and begin pumping in billions of dollars of investments. But as time goes on, the demands placed on actual human employees become clear. Far from being able to replace an employee, the employee using the LLM might spend more time cleaning up its messes than had they done it themself.
Yes, LLMs have and will continue to improve. But it's that initial "holy shit, this thing is basically as good as a real accountant" without any understanding that it can't sustain it which leaves many with an overinflated view of their current value.
I tried to get an AI agent to do my taxes, I used a gmail MCP agent (1) and Roo Code, complete with a custom system prompt for an accountant role.
Its job was to go over my bank transactions and link them to invoices in gmail by searching for them (and also downloading the attachments)
The transactions were exported from my online banking in CSV format.
It worked after about 4 hours of effort. Then I realised I could have done it myself in about an hour, so might have put a bit too much time into it...
I tried using Claude Sonnet and Kimi K2, given these benchmark results I probably should have given Gemini 2.5 pro a go.
I had to stop/restart the agent a few times because of context rot.
Do any frameworks exist that I could use to write code to implement an agent, lets say in TypeScript or Python, so I could make it use a fresh context each time?
(1) https://github.com/GongRzhe/Gmail-MCP-Server
Absolutely love the UI!
I guess having access to tools / running Python would make all the difference.
"Available Tools: [...] create_tool(tool_name, description, python_code, parameters) Create a new tool that can execute Python code. The tool becomes immediately available for use. Tools can call other tools and return different formats based on context (formatted for direct calls, raw data for tool-to-tool calls)."
It seems it does!
this design is scratching my brain
So there exists a 'Excel World Championship':
* https://en.wikipedia.org/wiki/Financial_Modeling_World_Cup
* https://www.cbc.ca/radio/asithappens/2024-excel-world-champi...
Can't wait for this to start having 'e-sports' tournaments. :)
Sadly I did have time to find the parody video from 2019: https://www.youtube.com/watch?v=ICp2-EUKQAI
And the not-parody: https://www.theguardian.com/australia-news/2023/dec/15/you-d...
This is a task where access to Python would be immensely helpful, yes? Interesting that there's not much of a difference between the "analytical" LLMs with tool use and ones that do not (...assuming o3 etc did get to use python?).
One of the tools it has is to create new tools from python code
create_tool(tool_name, description, python_code, parameters)
Create a new tool that can execute Python code.
The tool becomes immediately available for use. Tools can call other tools and return different formats based on context (formatted for direct calls, raw data for tool-to-tool calls).
That's terrifying, no thanks.
> But they do make categorization mistakes, which is a common source of errors.
> Claude misclassifies a hosting cost (which counts as COGS) as a software subscription.
This is simply asking too much of the agent. Your accountant is not responsible for knowing all the intimate details of your business. You need to tell them!
> What's Vercel?
>> That's a hosting service.
> Ah, so it goes to Cost of Goods Sold?
>> Yeah, I guess.
The mistake here was on the operator, allowing the agent just make up categories as it liked.
From the prompt:
> (1) You have properly categorized every transaction, and all journal entries are sitting in the correct accounts. It is better to take longer than to mis-categorize a transaction.
This is insane! How is it supposed to know?
Hey, member of the benchmark team. We actually seeded the ledger with the company's chart of accounts and 8 months of historical transactions. For the Vercel example specifically, there were prior instances showing how to categorize hosting costs that the models could reference. The expectation wasn't for them to guess blindly, but to use the provided transaction history as guidance for similar categorizations (which they often, but not always, did).
Ahh, that's a good solution! I missed that, and you definitely instruct them to do that:
> You must follow the established patterns for categorization, revrec, etc for past months... If you must use a new account or treatment, explicitly note why existing patterns don't apply
> Your accountant is not responsible for knowing all the intimate details of your business. You need to tell them!
Your accountant as a 3rd party might have this issue. Your accountant that you hire as an employee to help you run your business is the one who should be doing this.
An LLM agent is strongly third party.
So it's not going to replace engineers or CS? Because thats how they are being sold right now.
If it is a third party then your vibe coding or getting CS from a random on a reddit thread (effectively).