Artificial intelligence is enjoying an extended period of high visibility, usually associated with a brand-new technology. New models are being announced at pace by the major providers and are still being framed as breakthroughs – expanded parameters, revamped compute, fresh capabilities, etc.
However, away from the public perception of AI as a novel technology to be experimented with, financial institutions have already placed big bets and are now focussing on scale – establishing architectures that allow the business to use AI far and wide across the organisation, without sacrificing reliability. 

Rather than a philosophical shift, it is a commercial one, shaping how capital is allocated, how risk is managed, and how competitive advantage is built. And our latest research into how banks are approaching AI shows that this shift is already visible, as they leave experimentation behind and focus fully on execution. 

In financial services, where AI systems must operate under regulatory scrutiny, deliver consistent performance, and run within finite and predictable cost predictions, the banks have been focused on building effective guardrails to ensure real-world suitability for their AI deployments from the beginning.

But whereas in the early stages of generative AI cycle, banks were still prioritising frontier innovation – i.e. where can we use the latest off-the-shelf models, and fast –we’re now seeing the focus shift decisively to how systems can be scaled compliantly where it matters most.

The era of stress-testing 

One of the clearest signals in bank-led AI work is the elevation of evaluation and safety from secondary considerations to core design requirements.

Across the industry, there is growing recognition that impressive demonstrations in controlled settings are insufficient. These systems need to perform under real-world operational pressure, where errors carry financial, regulatory, and reputational consequences.

For example, Goldman Sachs’ researchers have shown that general-purpose language models struggle to detect hallucinations in long, complex financial documents. In other words, even highly capable systems can produce confident-sounding errors in precisely the contexts where accuracy matters most. They found that domain-specific adaptation and benchmarking significantly improve outcomes, underscoring the limits of off-the-shelf deployment.

Other institutions are focusing on monitoring rather than architecture alone. Research from BNY explores methods for identifying when an AI interaction becomes high-stakes – situations where a wrong answer should trigger additional oversight or escalation. This reflects a broader shift towards systems that can signal uncertainty, rather than operating uniformly across all use cases.

Meanwhile, JPMorgan Chase and Wells Fargo have highlighted that ‘machine unlearning’ – the idea that a model can reliably “forget” specific information such as personal data – is harder to verify than expected, raising important questions for data governance and compliance in deployed systems.

Taken together, these efforts point to a consistent conclusion: in banking, AI capability without rigorous evaluation and monitoring is risk, not progress.

Tailored systems are easier to deploy

Alongside evaluation, efficiency has emerged as a defining concern. Instead of relying on single, monolithic models, banks are increasingly exploring systems built from specialised components, while complex tasks are broken down into multiple stages, allowing different models to handle different parts of a workflow.

For banks, this approach offers clear advantages. Systems built from smaller components are easier to govern, simpler to audit, and more adaptable as business or regulatory requirements change. Crucially, they also allow institutions to align compute spend with business value, rather than applying maximum resources indiscriminately.

Banks are increasingly using AI systems that activate only the parts of a model needed for a specific task, rather than running everything at full power all the time. They are also adopting techniques that make models smaller and more efficient without meaningfully affecting performance. Better results, but less computing power.

Taken together, these approaches reflect a shift in how progress is measured: not by how large a model is, but by how much useful capability it delivers when scaled, for the resources it consumes. For banks operating under tight cost controls, this is not a technical preference – it is a practical requirement.

Why consistency beats cleverness

The past year has also seen significant interest in systems designed to reason step-by-step or take actions across tools and workflows – otherwise known as reasoning and agentic AI.

Here again, banking takes a notably pragmatic view. Recent findings suggest that many reasoning gains come from making existing use cases more consistent and reliable, rather than expanding what models can reason about in the first place. 

Evident’s analysis of Lloyds Banking Group reveals this approach, with the bank focusing AI on tasks where there is a clear right and wrong answer, rather than open-ended questions. By testing AI on problems that can be checked automatically at scale, Lloyds is placing greater emphasis on consistency and reliability than on more subjective ideas of what “good” reasoning looks like.

And while banks’ interest in agentic systems is growing, their experiments to date reveal that failures in these systems often resemble organisational breakdowns rather than technical ones: unclear roles, weak coordination, and insufficient oversight. The priority is clear: solve the governance challenge first, before considering how and where banks’ AI systems can be granted greater autonomy.

From experimentation to execution

Taken together, these trends point to an industry entering a more mature phase of AI adoption, as the emphasis on rapid experimentation across myriad use cases narrows towards systems deliberately designed for efficiency, control, and predictable behaviour.

For banks, this represents a necessary transition. The characteristics that have made generative AI difficult to deploy at scale – unpredictable costs, hallucination risk, and unstable performance – are being actively addressed through more disciplined design choices.

This shift away from “shiny AI” reflects the real progress the leading banks are making towards embedding practical, functional AI into core everyday processes. We’re likely to see the centrepiece CEO announcements about major new AI initiatives replaced by equally headline-grabbing statements about the tangible returns banks are seeing from the investments they’ve already made.

Earnings calls over press releases, shareholder value over frontier R&D bets. 

The era of frontier AI innovation isn’t over – far from it. But across financial services, the institutions pulling ahead are the ones doubling down on smaller AI systems that perform reliably under real-world constraints. AI that works predictably will compound value. AI that dazzles but wobbles will not.

Alexandra Mouszavizadeh, CEO and Co-founder, Evident