AI agents are quickly moving from experimentation to production inside financial institutions. Banks and fintechs are testing them for onboarding, fraud triage, transaction monitoring, customer communication and even full investigative work. 
At the same time model risk teams are already stretched. They are being asked to validate more models, more frequently, as expectations rise. In that environment, agentic AI will only scale safely if governance and evaluation are built into the system from the start.  


The conversation has largely focused on what these systems can do. Can they reason across complex data? Can they orchestrate workflows? Can they draft narratives or summarise investigations? 
Those are important questions. But they are not the ones that determine whether agentic AI belongs in regulated financial environments. The real question is simpler: What happens when the agent hallucinates? 

The autonomy accountability gap 

AI agents don’t behave like the deterministic software financial infrastructure was built on. They’re probabilistic systems operating in interactive loops, meaning the same objective can produce different paths, and failures often appear only after multiple steps. That’s why the National Institute of Standards and Technology, in its AI Risk Management Framework, treats generative systems as lifecycle risks that require ongoing measurement and oversight rather than one-time testing.  

Core banking systems, payment rails and compliance workflows are built on predictable logic. Given the same inputs, they are expected to produce the same outputs. They can be unit tested, regression tested and certified. 

Agentic systems do not behave that way. The same prompt may yield slightly different results. Edge cases may surface in unexpected ways. Performance may drift over time as data patterns change. 

In a consumer app, “mostly correct” may be acceptable. In financial compliance, it can still fail the standard. If an AI agent drafts an inaccurate Suspicious Activity Report (SAR) narrative, skips required investigative steps, or drives an inconsistent disposition, the issue is not cosmetic. It becomes a control failure that the institution must be able to defend under model risk management expectations, set by supervisors such as the Federal Reserve.   

GlobalData Strategic Intelligence

US Tariffs are shifting - will you react or anticipate?

Don’t let policy changes catch you off guard. Stay proactive with real-time data and expert analysis.

By GlobalData

This creates what I would call an autonomy accountability gap. Institutions are adopting systems that act with a degree of autonomy, but the accountability framework around those systems has not always kept pace. 

Guardrails are not optional 

In many organisations, governance is treated as a layer added after capability is proven. Teams focus on getting the agent to perform. Monitoring and oversight are addressed later. 

In low-risk software, you can bolt controls on later. With agentic systems, risk emerges over time. It shows up in how the system uses tools, retries actions, escalates decisions, and interacts across workflows. That’s why modern guidance increasingly treats guardrails, evaluation, and ongoing monitoring as core lifecycle requirements, not post-launch instrumentation.  

If you cannot guard the system, you should not deploy it. 

Guarding an agent is not about limiting innovation. It is about recognising that probabilistic systems operating inside deterministic regulatory regimes require technical controls. Policies and documentation alone are not enough. The guardrails must be engineered into the product from the start. 

In practice, that means building a structured evaluation and supervision framework before the agent goes live. 

The evaluation framework is the real product 

Many teams treat evaluation as a QA phase. For agents, evaluation becomes part of the core system. It is how you measure behaviour across multi-step workflows, detect drift, and demonstrate policy adherence over time. Research benchmarks for LLM-agents have emerged precisely because single-turn testing misses the failure modes that matter in interactive systems. 

A production-ready agent requires three distinct layers: deterministic control, observability and continuous optimisation. 

Deterministic Control (Safety Rails) establishes hard constraints the agent cannot bypass. Agents are probabilistic systems. Regulatory obligations are not. That tension must be resolved through hard constraints embedded in the workflow. Deterministic controls act as safety rails, enforcing policy rules, required investigative steps, data access boundaries, and escalation triggers that the agent cannot skip.  

Even if the underlying model drifts or produces unexpected outputs, these controls ensure that results remain within defined regulatory and operational limits. In compliance environments, this layer is non-negotiable.  

Observability (The Traceability Matrix) provides defined measures and the ability to track the system. You cannot manage what you cannot see. Institutions must be able to reconstruct how an output was generated, including the data inputs, intermediate reasoning steps, and any tools the agent used along the way.  

This level of traceability transforms an AI system from a black box into an auditable process. It enables internal validation, supports model risk governance, and allows institutions to respond confidently to supervisory inquiries. Without structured observability, accountability is theoretical.  

Continuous Optimisation (The LLM as a Judge Loop) is the last key layer. Agent performance cannot be assumed to remain stable. It must be evaluated continuously.  

Leading institutions are beginning to implement structured evaluation loops that compare agent outputs against curated benchmark datasets (Golden Datasets) and real-world cases. In some implementations, a secondary governed model is used to assess the primary agent’s outputs for accuracy, policy adherence, and completeness.  

This “model reviewing model” approach, when tightly controlled, can identify hallucinations, tone drift, and compliance gaps before outputs reach customers or regulators. Continuous optimisation closes the loop between deployment and accountability.  

From innovation to accountability

Regulators are increasingly focused on how AI-driven decisions are governed. Existing supervisory guidance already requires institutions to validate, monitor, and control models that influence risk decisions. Those expectations are becoming central to discussions around agentic AI as well.  

Globally supervisors such as the Basel Committee on Banking Supervision are examining how digitalisation and machine learning reshape banking risk profiles, reinforcing that governance must evolve alongside capability.  

Institutions that deploy agentic systems without a defensible evaluation framework may find themselves explaining not only what the system was designed to do, but why it was allowed to act without sufficient oversight. 

The institutions that succeed with agentic AI will not be those that move fastest to deploy. They will be those that move deliberately, embedding control, monitoring and optimisation into the architecture from day one. 

The industry’s focus should shift. The question is no longer whether an agent can solve a problem. It is whether the institution can control how it behaves and defend the decisions it produces. Trust in agentic systems does not come from capability alone. It comes from the ability to monitor, evaluate, and constrain them. In regulated finance, deployment should follow that standard.   

Lina Fabri, Senior Director of Product, ThetaRay