Last year, “hallucinations” produced by generative artificial intelligence (Generative AI [GenAI]) were in the spotlight in court, in court again, and certainly, all over the news. More recently, Bloomberg News said that in their 2024 annual reports, “Goldman Sachs Group Inc., Citigroup Inc., JPMorgan Chase & Co. and other Wall Street firms are warning investors about new risks from the increasing use of artificial intelligence, including software hallucinations, employee-morale issues, use by cybercriminals and the impact of changing laws globally.”
Meanwhile, Michael Barr, who recently departed as the US Federal Reserve Bank’s vice chair for supervision, foreshadowed these concerns in extemporaneous remarks he made in February at the Council on Foreign Relations. There he said that competitive pressure around incorporating generative artificial intelligence could heighten risks in financial services. Competitive pressure “may push all institutions, including regulated institutions, to take a more aggressive approach to genAI adoption,” increasing governance, alignment, and financial risks around AI, Barr said.
I couldn’t agree more. That’s why we at FICO have always advocated for operationalising GenAI responsibly, using solutions like focused language models (FLMs), and focussed task models to thwart hallucinations before they occur. In this article I’ll provide more background on GenAI hallucinations, and talk about these focused language models, FICO’s GenAI solution to help ensure that the “golden age of AI” remains bright.
Hallucinations are no illusion
GenAI hallucinations are indeed problematic. For example, researchers at Stanford University last year found that general-purpose GenAI tools like ChatGPT have an error rate as high as 82% when used for legal purposes. GenAI tools purpose-built for law applications are better, producing hallucinations 17% percent of the time, according to a different Stanford study, and shouldn’t be used without close, and time-consuming scrutiny.
Regardless of the hallucination rate, the problem is further exacerbated, in any industry, by the human consuming the GenAI output: they may not notice the hallucination or validate the output, instead acting directly upon it.
The fuel that stokes the fire
Factors that can lead to GenAI hallucinations include:

US Tariffs are shifting - will you react or anticipate?
Don’t let policy changes catch you off guard. Stay proactive with real-time data and expert analysis.
By GlobalData- The type, quality, quantity, and breadth of data used for pre-training.
- Low pre-training data coverage for key tokens and topics prompted This is related to associating words and/or groups of words with statistics associated with a prompt or use in an answer. If there is insufficient coverage, the LLM may make inferences based on noise rather than clear signals supported by strong coverage.
- Lack of self-restraint in LLM inference in not prohibiting use of low pre-training data coverage examples in responses. The issue stems from most LLMs not considering whether there is sufficient coverage to form their responses, instead assuming the response is statistically sound. Most LLMs do not inspect when there is low coverage to adequately support an answer. Ideally when this situation occurs, the LLM should indicate that it doesn’t have enough information to provide a reliable response.
- Lack of understanding that record retrieval argumentation (RAG) can increase the rate of hallucination by desensitising or destabilising relationships learned by the foundational model during its original pre-training. RAG can over-emphasise and change statistics locally in the prompt in unnatural ways.
Hallucinations are hard to see
Detecting hallucinations is difficult because LLM algorithms are usually not interpretable and do not provide visibility to justify their responses. Even if a Retrieval Augmented Generation (RAG) context was referenced in the response, you may find through human inspection that it was not actually used in the response.
The best way to minimise hallucinations is by building your own pre-trained fundamental generative AI model. Many organisations are now already using, or planning to use, this approach utilising focussed-domain and task-based models. By doing so, one can have critical control of the data used in pre-training—where most hallucinations arise—and constrain the use of context augmentation to ensure that such use doesn’t increase hallucinations but reinforces relationships already in the pre training.
Outside of building your own focussed generative models, one needs to minimise harm created by hallucinations. [Enterprise] policy should prioritise the process for how the output of these tools will be used in a business context and then validate everything.
FLMs are focussed on delivering accurate answers
FICO’s approach to using Generative AI responsibly starts with the concept of small language models (SLMs) which, as the name suggests, are smaller and less complex than LLMs. SLMs are designed to efficiently perform specific language tasks and are built with fewer parameters and often smaller training data. Like LLMs, SLMs are available from multiple providers and come with many of the same challenges as LLMs, although often at reduced risk.
My approach to achieving Responsible GenAI concentrates SLM applications further into a “focussed language model” (FLM), a new concept in that SLM development that is focussed around smaller but very deliberate data store specific to a very narrow domain or task. A fine level of specificity ensures the appropriate high quality and high relevance data is chosen; later, you can painstakingly tune the model (“task tuning”) to further ensure its correctly focused on a task at hand.
The FLM approach is distinctly different from commercially available LLMs and SLMs, which offer no control of the data used to build the model; this capability is crucial for preventing hallucinations and harm. A focused language model enables GenAI to be used responsibly because:
- It affords transparency and control of appropriate and high-quality data on which a core domain-focussed language model is built.
- On top of industry domain-focussed language models, users can create task-specific focussed language models with tight vocabulary and training contexts for the task at hand.
- Further, due to the transparency and control of the data, the resulting FLM can be accompanied by a trust score with every response, allowing risk-based operationalisation of Generative AI; trust scores measure how responses align with the FLM’s domain and/or task knowledge anchors (truths).
Scott Zoldi is Chief Analytics Officer at FICO