Building reliable agentic AI systems

195 points by sarangk90 4 days ago|50 comments

•

BodyCulture 3 days ago

Most important piece of information is in the linked Frontiers article:

   However, the overall capability of the chatbot to fully meet user needs received a lower average score (3.1/5.0), highlighting the need for further improvements.

Also there is still the problem of hallucinations, as we see in the „Evaluation“ paragraph:

   Live traffic evaluations are essential for monitoring system behavior, identifying potential issues like hallucinations in production, and understanding performance on diverse live queries.

This are quite devastating results. This is a system for scientific research on medicines and mediocrity and hallucinations will kill people.

Would be interesting to know how much money was flushed down the toilet with these experts.

•

sarangk90 3 days ago

Author here. A couple of things worth clarifying.

The 3.1/5.0 score in the Frontiers paper is a user satisfaction rating on feature completeness. Researchers were asked how well the system met all of their needs, including features that simply didn't exist yet at that point. It's a product maturity signal, not an accuracy or reliability number. The paper is also about a year old and the system has moved on significantly since.

On hallucinations, I'd push back on the framing a bit. The fact that we monitor for hallucinations isn't an admission that the system is hallucinating undetected. It's the opposite. Every sentence in the response is linked back to the exact page and verbatim quote from the source document, so a researcher can verify any claim in one click. We also run faithfulness scoring on live traffic every single day using RAGAS, so if the system starts drifting we catch it fast, not at some quarterly review.

And for the regulatory document drafting use case, every output is explicitly reviewed and approved by a qualified scientist before it goes anywhere. The system drafts, the human decides. That's not incidental; it's a design constraint baked into the architecture.

No LLM eliminates hallucination entirely. That's just the reality of the technology right now. So the engineering question becomes: how do you make it as unlikely as possible, and when it does happen, how fast do you catch it? That's what the retrieval pipeline, the reflection agent, the citations, and the daily evals are all doing. It's not a perfect answer, but it's a serious one.

•

cocoa19 3 days ago

> On hallucinations, I'd push back on the framing a bit. > It's not a perfect answer, but it's a serious one.

Thanks Claude!

•

OkWing99 3 days ago

At this point, the papers, reviews and defense are all done by bots.

From the paper, "we collected feedback from 15 to 20 frequent users". Is it 15 or 20?

I lot of interpretive claims, it reads more like a marketing case study.

•

cadamsdotcom 3 days ago

> That's not (thing); it's (other thing).

Thank you for answering the commenter’s question, your answer was informative and helpful.

However it currently reads like you generated it. Which is a surprise, when you’re trying to foster trust in the quality of your process.

•

andrew_lettuce 3 days ago

It seems weird to ask users about feature completeness, especial regarding a new system. By definition, if you are hitting a valuable use case you WON'T be feature complete, and this all assumes users can even determine functional boundaries or useful features. I would have expected better from an organization that positions itself as an expert at guiding software development, but I guess they're a consultancy first and foremost.

•

bob1029 4 days ago

The most important part is the database that the agent can see and how clean the data is. I pitched a custom enterprise agent to a client thinking it would be maybe 50/50 time on data vs agent tuning, but it's more like 99/1.

The alignment process goes very quickly once you have all the fish in exactly one barrel. I think pulling data dynamically from the source systems is where this turns into a game of whack-a-mole.

The problem with dynamic fetch is that you don't get any kind of persistent or compounding gains. There are queries that you simply cannot run because you'd chew through your GitHub, et. al., API quotas. It takes over 48h to fully hydrate the database for GitHub items on my current project. But, once that process is complete I can query across things like issue comments and do crosscutting joins with the state of other vendor systems in milliseconds.

I am finding the MSSQL dialect to be quite agreeable to the OAI models. With absolutely no prompting they will bootstrap off information schema and extended description properties every single time. If you design the schema for your audience, the amount of "Jesus prompting" you will require is much better controlled.

•

dbuxton 4 days ago

The problem with this is that you have a consistency problem if you want to take action. The only way of making your agents read-write rather than read-only in practice is to use the underlying systems rather than try and pool information in a data lake.

But that does make it more complex to build simple information retrieval use cases.

•

bob1029 4 days ago

You definitely need multiple paths. It's not one vs the other. The data warehouse is immutable to the agent. There are separate tools that handle any updates to business state.

•

cpinto 4 days ago

I’m trying to solve this problem and would love to get some feedback: https://entidade.wls-labs.com

•

bingemaker 4 days ago

I have a question: How does connecting agent to db directly work in case of multi tenant system? There is a high chance that agent can snoop into multiple tenants and mess up the responses

•

bob1029 3 days ago

I think this mostly depends on your business model.

In my client's business, the idea of having all their customer knowledge contained in one global scope is a fantasy, not a fear.

I suppose if you were granting access to users outside the business that this could become a concern, but I haven't encountered anyone who is interested in that yet.

•

shepherdjerred 3 days ago

Row level security sounds reasonable. Otherwise I don’t see how full DB access can be safe.

•

CuriouslyC 4 days ago

With postgres you can use schemas to keep tenants separate and use RLS on shared data.

•

hilariously 4 days ago

That's funny because I have worked with hundreds and hundreds of TSQL products and only those built mostly in the last century have extended attributes (documenting each object in the database itself) - its nice but clunky and few things support it.

•

smallnix 4 days ago

What was the main driver for a dynamic workflow with loops vs a rigid forward running only workflow. The non-deterministic nature of these loops with LLM decision points doesn't mesh well with the transparency requirement imho

•

stevex 4 days ago

You can almost tell the "era" that a solution was built in these days since things are changing so fast.

Mid-2026, we have very large context windows, and much smarter models than we did in 2024 when this was built. If I were to tackle this today I'd ask a current frontier model to work through the source data and design a hierarchy that would give it the ability to sift through the content itself by drilling down as it sees fit, and I expect it would nail that.

•

dominotw 3 days ago

i can almost tell then you have not done anything like this in production scale. context window size is irrelevant.

•

amw-zero 3 days ago

It would not, and you would know that if you actually evaluated the results.

•

agentdev001 3 days ago

I have gone through this process and evaluated the results. Maybe you're referring to their comment as written, but going through what OC described + handholding leads to very good results in my experience.

•

sarangk90 3 days ago

I agree with you agentdev! Here, you want accurate results, you need to have harness in place to control the quality of output.

•

dominotw 3 days ago

"very good" 99 percent of time and hallucinating 1 percent makes the "very good" part untrustworthy.

•

agentdev001 3 days ago

The "Very good" I'm referring to is far better than only 99%. I can't offer solid stats off the top sadly, so you'll have to just take my word for it ;)

I'll take the opportunity to note that if you're running solid evals, you'll have data to back the efficacy of your system. If you are seeing a hallucination rate of 1%, then you certainly should be working on your harness/toolset/context/prompting etc.

Saying "1% hallucination rate..." is akin to saying "30,000mi lifespan for [modern japanese make engine]". Something is wrong.

•

AJRF 4 days ago

Two paragraph section on Evaluation after 30 paragraphs explaining the most bog standard rag system you've ever heard of.

Hmm...

•

manipalite 4 days ago

Yeah that's what we're realising in building pharma specific solutions for clients, high quality eval dataset and automated evals integrated in the CI/CD process seem to be the differentiators

•

altmanaltman 4 days ago

> The author used AI assistance during the writing of this article. AI tools were used for brainstorming ideas, creating outlines, and reviewing drafts to polish language and improve clarity.

The first sentence makes it seem like they just used to improve sentence structure etc but the second line makes it seem like they used it for 90% of the work. Which one is true?

•

ares623 4 days ago

Your question answered it I think. The first sentence aims to mislead. The second sentence covers their ass.

I'd love to see the number of man hours that led to that sentence, and how proud they were to have come up with it.

•

Xx_crazy420_xX 2 days ago

4 different databases when you could just postgres. Also seems that 'Think and Plan' and 'Reflect' phases are redundant, as stated: 'Think & Plan: Process Reflection'. Also more personal opinion is that LangGraph is unnecessary framework only slows you down by spiking up complexity.

Not sure how you manage to measure Faithfulness and Answer Relevancy on the live system, without the ground truth.

Good that you have evals in place, but the user satisfaction score might suggest running ablations on the system would be beneficial. I would start by reducing the iterations and unnecessary steps from the agent.

•

shasyn 9 hours ago

LLMs are great planners. They are terrible executors

WE'RE WORKING ON AN ARCHITECTURE, YOU CAN OUR EARLY TESTERS IF YOU WANT

•

Littice 4 days ago

The part about context discipline feels underrated. Larger context windows don’t remove the need to decide what the model shouldn’t see.

•

sarangk90 4 days ago

Totally!

•

AJRF 3 days ago

Seeing this article and seeing the replies. Oof. Maybe Thoughtworks did some good work in traditional software engineering (not sure) - but why would you trust them to touch anything related to LLMs. They don't seem to know what they are doing.

•

andrew_lettuce 3 days ago

They trade on the brand name and Martin Fowler's reputation, but even in their heyday were considered pedantic architecture astronauts by many of us trying to get shit done.

•

dominotw 2 days ago

i worked with them at sears. They were always so smug holier-than-you douchebags. There were some good talented ppl but as group they were complete douchebags. They were not even smart not sure why they were so smug.

•

mhitza 3 days ago

The infra/agent architecture to run all this is becoming a bit much, if we want to have (some level of) assurable output, in any area of their application. Not only medicsl research needs citation-backed output.

This is an ongoing working project since 2024, I would like to see some KPI metrics to back off any productivity /job satisfaction improvement in the research department, or what have you, at Bayer.

Monthly average token usage would be another interesting information to read about. Paired with any latency numbers (time to first token, for example).

•

ThePhysicist 4 days ago

I think for mostly search-focused use case like the one presented here AI is great as you don't ask it to build stuff or invent new drugs, you just want to retrieve relevant documents with laser precision, and agents can do that.

I think right now I'm mostly disappointed with agents writing code as they always degrade the quality of the codebase after a while, and the same goes for writing in general which just requires a ton of editing and mostly just sounds good but doesn't have a lot of substance in the end. I think you can really tell that these systems are trained to just produce plausible streams of text, especially in longer artefacts you notice that locally the inner consistency of what they produce is great but globally it really falls apart, it's like seeing the limits of their "intelligence".

For search however I really like AI, it has improved information retrieval so much for me where before I had to think about which keywords to use and combine and which filters to apply, describing what I'm looking for in plain text and then having the AI find it for me feels magical. Recently I wanted to find an artist that I heard in some old episode of the KEXP runcast (a running podcast), and I didn't remember anything except that it was rap with a kind of monotone voice a fast beat and a strong accent. Googles' agent asked a few clarifying questions and after a few rounds it found the artist for me, Genesis Uwusu. That's why I think Google will win in the AI assisted search market, they just have the best integration between fast and reasonably "smart" agents and high quality search data. Claude or ChatGPT are too slow and don't have fast enough data retrieval it seems, using them for search feels quite sluggish in comparison.

•

maccard 4 days ago

> you just want to retrieve relevant documents with laser precision

My experience with using LLMs for search is that they do _not_ have laser precision. Far from it in fact. If you want to retrieve documents with laser precision AI is the wrong tool. If you want a fuzzy, lossy synthesised query response based on those documents, LLMs are great.

•

hirako2000 3 days ago

I happen to have made another attempt trusting an agentic tool, latest "Continue" with frontier model: 8M token burned in 30 minutes. App does not work.

•

ai_slop_hater 4 days ago

> Sarang Kulkarni is a Principal Consultant at Thoughtworks

> teaches an O’Reilly course on building production-ready RAG applications

isn't this basically saying that you are a scammer? or am I paranoid?

•

orochimaaru 3 days ago

Is it really required to get personal here?

•

ai_slop_hater 3 days ago

Nothing personal, just pointing out scam and botting on HN

•

agentdev001 3 days ago

I find papers/articles which discuss solutions that rely heavily on a model in the middle unreadable, if the models used are not discussed.

The data you need to get into context for a small model, vs a big boy frontier model, vs a fine tuned open weight big boy- are all very different. I can understand what they're doing here, and most of the 'why', but- not all of the why.

•

yieldcrv 4 days ago

The funniest part of these systems is that I build these massive prompt concatenating controllers with a schema to constrain what the LLM sees and parses, usually a frontier model like Gemini

The model gets it wrong on occasion and I check the input file with Claude/Opus and it just laughs at how simple it is to get the document right

And in the back of my mind I’m thinking why am I not just sending the file through Opus

•

padolsey 4 days ago

These vast multi-agentic systems with roles like 'Researcher', 'Writer' (with a review loop), 'Reflection agent', seem to ~feel~ mostly right but lack evals as to the merit of agent decomposition. So it forms a satisfying enough flowchart but I see no evidence these authors actually tried other approaches or agent roles. And let's be honest: an agent is just a system prompt and output contracts, and these rich architectures seem to be pontificating beyond their worth. It all feels a bit vibe-y.

•

flir 4 days ago

I've been wondering lately if old-school cybernetics might help there.

But my off-the-cuff, uninformed opinion is that the precise structure doesn't matter too much, and the impact these structures really have is that they allow More Tokens for the Token Furnace.

•

niyikiza 4 days ago

What would the benefit be? A mega agent that does everything?

There are some well documented advantages of decomposition...that's why the industry favours microservices over monoloths.

•

mattmanser 4 days ago

Bad architects favour microservices over monoliths.

YAGNI almost always applies to microservices, and the coordination overhead and boilerplate they add introduces immense costs, especially for smaller companies.

This homogenisation of architecture around Netflix size engineering has really cost our industry a lot.

•

marsven_422 4 days ago

You cannot

•

oytis 4 days ago

Absolutely. Either you use LLMs and tolerate unreliability or you are writing proper reliable software yourself

•

ai_slop_hater 4 days ago

Why is comment from padolsey dead? Seriously, something fishy is going on on this website.