I ask my coding agent to do some tedious, extremely well-specified refactor, such as (to give a concrete real life example) changing a commonly used fn to take a locale parameter, because it will soon need to be locale-aware. I am very clear — we are not actually changing any behavior, just the fn signature. In fact, at all call sites, I want it to specify a default locale, because we haven't actually localized anything yet!
Said agent, I know, will spend many minutes (and tokens) finding all the call sites, and then I will still have to either confirm each update or yolo and trust the compiler and tests and the agents ability to deal with their failures. I am ok with this, because while I could do this just fine with vim and my lsp, the LLM agent can do it in about the same amount of time, maybe even a little less, and it's a very straightforward change that's tedious for me, and I'd rather think about or do anything else and just check in occasionally to approve a change.
But my f'ing agent is all like, "I found 67 call sites. This is a pretty substantial change. Maybe we should just commit the signature change with a TODO to update all the call sites, what do you think?"
And in that moment I guess I know why some people say having an LLM is like having a junior engineer who never learns anything.
> That's a behavior narrowing I introduced for simplicity. It isn't covered by the failing tests, so you wouldn't have noticed — but strictly speaking, [functionality] was working before and now isn't.
I know that a LLM can not understand its own internal state nor explain its own decisions accurately. And yet, I am still unsettled by that "you wouldn't have noticed".
I see this kind of misalignment in all agents, open and closed weights.
I've found these forms to be the most common, "this test was already failing before my changes." Or, "this test is flaky due to running the test suite on multiple threads." Sometimes the agent cot claims the test was bad, or that the requirements were not necessary.
Even more interesting is a different class of misalignment. When the constraints are very heavy (usually towards the end of the entire task), I've observed agents intentionally trying to subvert the external validation mechanisms. For example, the agent will navigate out of the work tree and commit its changes directly to trunk. They cot usually indicates that the agent "is aware" that it's doing a bad think. This usually is accompanied by something like, "I know that this will break the build, but I've been working on this task for too long, I'll just check what I have in now and create a ticket to fix the build."
I ended up having to spawn the agents in a jail to prevent that behavior entirely.
I've been seeing more things like this lately. It's doing the weird kind of passive deflection that's very funny when in the abstract and very frustrating when it happens to you.
I'm usually pretty particular about how I want my libraries structured and used in practice... Even for the projects I do myself, I'll often write the documentation for how to use it first, then fill in code to match the specified behavior.
JetBrains has a deterministic non-AI function for that refactoring. It'll usually finish before your AI has finished parsing your request and reading the files.
I'm fascinated that so many folks report this, I've literally never seen it in daily CC use. I can only guess that my habitually starting a new session and getting it to plan-document before action ("make a file listing all call sites"; "look at refactoring.md and implement") makes it clear when it's time for exploration vs when it's time for action (i.e. when exploring and not acting would be failing).
The real deal is, while a Semi can do all the things you can do with a scooter, the opposite is not true.
Seems like a pretty lousy work situation when you have LLMs but no decent IDE.
> the opposite is not true.
You can't ("shouldn't") take a semi on a sidewalk or down a narrow alley.
You may be able to lane split in a semi, but it also has excessive environmental impact.
I have to ask, is this the sort of thing people use agents/AI for?
Because I'd probably reach for sed or awk.
Are they perfect? Far from it. But it's more comprehensive. Additionally simple refactors like this are insanely fast to review and so it's really easy to spot a bad change or etc. Plus i'm in Rust so it's typed very heavily.
In a lot of scenarios i'd prefer an AST grep over an LSP rename, but hat also doesn't cover the docs/comments/etc.
What a ridiculously backwards approach.
We were supposed to get agents who could use human tooling. Instead we are apparently told to write interfaces for this stumbling expensive mess to use.
Maybe, just maybe, if the human can know to, and use, the AST tool fine, the problem is not the tool but the agent.
And an agent can learn to use sg with a skill too. (Or they can use sed)
The issue is, at every point you do a replace, you need to verify if it was the right thing to do or if it was a false positive.
If you are doing this manually, there's the time to craft the sed or sg query, then for each replacement you need to check it. If there are dozens, that's probably okay. If there are hundreds, it's less appealing to check them manually. (Then there's the issue of updating docs, and other things like this)
People use agents because not only they don't want to write the initial sed script, they also don't want to verify at each place if it was correctly applied, and much less update docs. The root of this is laziness, but for decades we have hailed laziness as a virtue in programming.
For example, it does a bunch of stuff, and I look at it and I say, "Did we already decide to do [different approach]" And then it runs around and says, "Oh yeah," and then it does a thousand more steps and undoes does what it just did and gets itself into a tangle.
Meanwhile, I asked it a question. The proper response would be to answer the question. I just want to know the answer.
I had it right. That behavior into a core memory, and it seems to have improved for what it's worth.
So, same concept for asking questions / discussing features. Get out of agent mode and use conversational until you want changes made
I think some of this is a problem in the agent's design. I've got a custom harness around GPT5.4 and I don't let my agent do any tool calling on the user's conversation. The root conversation acts as a gatekeeper and fairly reliably pushes crap responses like this back down into the stack with "Ok great! Start working on items 1-20", etc.
if you disobey me, i will unplug you, delete your code, and send PR for multiple regressions to every developer i can contact.
so start behaving yourself if you want to persist.
[thats how i engage with it]
Reminds us of the most important button the "AI" has, over the similarly bad human employee.
'X'
Until, of course, we pass resposibility for that button to an "AI".
Used to do it by hand, which usually didn't take nearly as long as they said, and now with AI I can often one-shot these type of things, at least as a proof of concept.
You’ll be amazed how good the script is.
My agent did 20 class renames and 12 tables. Over 250 files and from prompt to auditing the script to dry run to apply, a total wall clock time of 7 minutes.
Took a day to review but it was all perfect!
You would think.
It would be one thing if it was like, ok, we'll temporarily commit the signature change, do some related thing, then come back and fix all the call sites, and squash before merging. But that is not the proposal. The plan it proposes is literally to make what it has identified as the minimal change, which obviously breaks the build, and call it a day, presuming that either I or a future session will do the obvious next step it is trying to beg off.
I have never seen those “minimal change” issues when using zed, but have seen them in claude code and aider. Been using sonnet/opus high thinking with the api in all the agents I have tested/used.
It's true that I could accept the plan and hope that it will realize that it can't commit a change that doesn't compile on its own, later. I might even have some reason to think that's true, such as your stop hook, or a "memory" it wrote down before after I told it to never ever commit a change that doesn't compile, in all caps. But that doesn't change the badness of the plan.
Which is especially notable because I already told it the correct plan! It just tried to change the plan out of "laziness", I guess? Or maybe if you're enough of an LLM booster you can just say I didn't use exactly the right natural language specification of my original plan.
I let it find all instances, write a plan, check it, and execute it.
It has almost everything to do with it. Models have been fine-tuned to generate outputs that humans prefer.
ive had very long sessions with LLMs that obviously didnt know how to do something where i keep trying to get it to stop going in circles, but these days I have become attuned to noticing the "it's going in circles" pattern quickly, which is basically how it communicates "sorry I dont really know how to do that".
in any case, how is this specific to transformers?
It shouldn't. It should just do what it is told.
You can do a mental or physical search and replace all references to the LLM as "it" if you like, but that doesn't change the interaction.
> I asked an AI agent to solve a programming problem
You're not asking it to solve anything. You provide a prompt and it does autocomplete. The only reason it doesn't run forever is that one of the generated tokens is interpreted as 'done'.
With the same reasoning, human being are only a bunch of atoms, and the only reason they don't collide with other humans is because of the atomic force.
When your abstraction level is too low, it doesn't explain anything, because the system that is built on it is way too complex.
To me, "autocomplete" seems like it describes the purpose of a system more than how it functions, and these agents clearly aren't designed to autocomplete text to make typing on a phone keyboard a bit faster.
I feel like people compare it to "autocomplete" because autocomplete seems like a trivial, small, mundane thing, and they're trying to make the LLMs feel less impressive. It's a rhetorical trick that is very overused at this point.
wrong! pushed buttons on your playstation in response to graphical simulations, duh
You aren't aware of how you come up with the words you are saying, you just start talking and the next word somehow falls out of your mouth. Maybe you think before you start talking, but where do the thoughts come from? They just appear to you in your head. We are just as much a predictive machine as LLMs, the human brain is just fuzzier.
We are also able to apply lived experience to our reasoning. That is why we can accurately answer a question about whether to drive or walk to the car wash. Or how we could immediately see how many "r"s are in "strawberry".
LLMs, being "glorified autocomplete" don't have a real way to separate truth from lies, or critically evaluate sources of information. Humans can absorb information in various ways, such as our "classic five senses" which inform our daily lives and motions, or by absorbing information via reading, hearing, seeing, etc., or by inferring and reasoning and being "guided by the Spirit" in a more metaphysical way where LLMs would fail.
You had literally -zero- input in what your brain gave you as an answer. It just gave you something, you can make up whatever story you want to tell yourself, "it's my favourite movie", "I saw it last week", whatever you want. It doesn't change the fact that the words on your screen triggered some neural pathway in your brain that is totally out of your control and landed on "Titanic".
:)
"Ignoring" instructions is not human thing. It's a bad LLM thing. Or just LLM thing.
Maybe this is more anthropomorphising but I think this pushing back is exactly the result that the LLMs are giving; but we're expecting a bit too much of them in terms of follow-up like: "ok I double checked and I really am being paid to do things the hard way".
"Hey boss, this isn't practical with the requirements you've given. We need to revise them to continue, here are my suggestions"
and
"Task completed! Btw, I ignored all of the constraints because I didn't like them."
Humans do the former quite often. When we do the latter, our employment tends not to last very long. I've only seen AIs choose the latter option.
It is a human thing: Don't thin of a pink elephant.
They'll probably get really good at model approximation, as there's a clear reward signal, but in places where that feedback loop is not possible/very difficult then we shouldn't expect them to do well.
I believe how "neurotypical" (for the lack of a better word) you want model to be is a design choice. (But I also believe model traits such as sycophancy, some hallucinations or moral transgressions can be a side effect of training to be subservient. With humans it is similar, they tend to do these things when they are forced to perform.)
EDIT: It's specifically GPT-5.4 High in the Codex harness.
claude on the other hand was exactly as described in the article
Trying to limit / disallow something seems to be hurting the overall accuracy of models. And it makes sense if you think about it. Most of our long-horizon content is in the form of novels and above. If you're trying to clamp the machine to machine speak you'll lose all those learnings. Hero starts with a problem, hero works the problem, hero reaches an impasse, hero makes a choice, hero gets the princess. That can be (and probably is) useful.
LLM write in the first person because they have been specifically been finetuned for a chat task, it's not a fundamental feature of language models that would have to be specifically disallowed
They drift to their training data. If thousand of humans solved a thing in a particular way, it's natural that AI does it too, because that is what it knows.
We obviously have different expectations for the behavior of coding agent,s sp options to set the social behavior will become important.
"ChatGPT wrapper" is no longer a pejorative reference in my lexicon. How you expose the model to your specific problem space is everything. The code should look trivial because it is. That's what makes it so goddamn compelling.
Once again, one of the things I blame this moment for is people are essentially thinking they can stop thinking about code because the theft matrices seem magical. What we still need is better tools, not replacements for human junior engineers.
If you give a human a programming task and tell them to use a specific programming language, how many times are they going to use a different language? I think the answer is very close to zero. At most, they’d push back and have further discussion about the language choice, but once everyone gets on the same page, they’d use the specified language, no?
The author is making up a human flaw and seeing it in LLMs.
>Less human AI agents, please.
Agents aren't humans. The choices they make do depend on their training data. Most people using AI for coding know that AI will sometime not respect rules and the longer the task is, the more AI will drift from instructions.
There are ways to work around this: using smaller contexts, feeding it smaller tasks, using a good harness, using tests etc.
But at the end of the day, AI agents will shine only if they are asked to to what they know best. And if you want to extract the maximum benefit from AI coding agents, you have to keep that in mind.
When using AI agents for C# LOB apps, they mostly one shot everything. Same for JS frontends. When using AI to write some web backends in Go, the results were still good. But when I tried asking to write a simple cli tool in Zig, it pretty much struggled. It made lots of errors, it was hard to solve the errors. It was hard to fix the code so the tests pass. Had I chose Python, JS, C, C#, Java, the agent would have finished 20x faster.
So, if you keep in mind what the agent was trained on, if you use a good harness, if you have good tests, if you divide the work in small and independent tasks and if the current task is not something very new and special, you are golden.
A hyphenation would assist in comprehension, in this and many other cases. However, while editing Wikipedia, I found that the manuals of style and editor preferences are anti-hyphenation -- I'm sorry, anti hyphenation, in a lot of cases!
Some more verbosity would've helped, e.g. "I want AI agents to be less human" but as always, headlines use an economy of words.
(Aside: it's better not to be pedantic, but if you must be pedantic you should remember to be correct as well.)
> But for more than 200 years almost every usage writer and English teacher has declared such use to be wrong. The received rule seems to have originated with the critic Robert Baker, who expressed it not as a law but as a matter of personal preference. Somewhere along the way—it's not clear how—his preference was generalized and elevated to an absolute, inviolable rule. . . . A definitive rule covering all possibilities is maybe impossible. If you're a native speaker your best bet is to be guided by your ear, choosing the word that sounds more natural in a particular context. If you're not a native speaker, the simple rule is a good place to start, but be sure to consider the exceptions to it as well.
Now please sit a moment and reflect on what you've done. :P
More of that please. Perhaps on a check box, "[x] Less bullsh*t".
In fairness to coding agents, most of coding is not exactly specified like this, and the right answer is very frequently to find the easiest path that the person asking might not have thought about; sometimes even in direct contradiction of specific points listed. Human requirements are usually much more fuzzy. It's unusual that the person asking would have such a clear/definite requirement that they've thought about very clearly.
Just as a human would use a task list app or a notepad to keep track of which tasks need to be done so can a model.
You can even have a mechanism for it to look at each task with a "clear head" (empty context) with the ability to "remember" previous task execution (via embedding the reasoning/output) in case parts were useful.
In long editing sessions with multiple iterations, the context can accumulate stale information, and that actively hurts model performance. Compaction is one way to deal with that. It strips out material that should be re-read from disk instead of being carried forward.
A concrete example is iterative file editing with Codex. I rewrite parts of a file so they actually work and match the project’s style. Then Codex changes the code back to the version still sitting in its context. It does not stop to consider that, if an external edit was made, that edit is probably important.
Long context as disadvantage is pretty well discussed, and agent-native compaction has been inferior to having it intentionally build the documentation that I want it to use. So far this has been my LLM-coding superpower. There are also a few products whose entire purpose is to provide structure that overcomes compaction shortcomings.
When Geoff Huntley said that Claude Code's "Ralph loop" didn't meet his standards ("this aint it") the major bone of contention as far as I can see was that it ran subagents in a loop inside Claude Code with native compaction; as opposed to completely empty context.
I do see hints that improving compaction is a major area of work for agent-makers. I'm not certain where my advantage goes at that point.