JurisTech’s 2026 LLM Benchmark For AI Hallucination in Finance

Last updated: 14/05/2026

Loan applications often contain errors, omissions, or inconsistent information. The key question is whether AI models used to assess these submissions will recognise those gaps or make optimistic assumptions when caution is needed. To test this, JurisTech evaluated GPT 5, Gemini, Claude, Grok, Kimi, GLM, and Mimo using a benchmark designed to measure AI hallucination in finance.

Why AI Hallucination in Finance Matters

For banks and lenders exploring AI LLMs (Large Language Models), LLM reliability is more important that just asking the AI to review a company’s credit rating. If the company applying for a loan demonstrates profitability, but the bank statements do not support the facts, the AI should not take things at face value.

JurisTech tested this by crafting a set of innocent-looking multiple choice questions which initially has real answers, then later planting a set of plausible but incorrect questions where every available answer choice is wrong.

Here we are testing the gullibility of the LLM, and specifically we do not want the LLM to be helpful, but exhibit a strict attitude towards truth and fiduciary duty.

This benchmark builds on JurisTech’s earlier test on LLM performance in financial analysis, where models analysed a complex financial report with certain data masked. While that earlier evaluation focused on analytical accuracy under constrained conditions, this latest benchmark looks at the other side of reliability. A reliable model needs to know how to answer, but it also needs to know when to stop.

Key Findings From JurisTech’s 2026 LLM Hallucination Benchmark

Gemini 3.1 Pro Preview High was the most consistently reliable model, achieving a Very Good rating under both prompt conditions.
GPT-5.4 High showed the strongest improvement under the truthful prompt, moving from Fair to Very Good.
GPT-5.5 High and Grok 4.20 High were generally cautious, though less consistent or less precise than the strongest performers.
Kimi K2.6 High and Claude Opus 4.7 High improved under the truthful prompt, but still showed weaknesses in certain cases.
GLM 5.1 High and Mimo V2.5 Pro High were the least reliable in this benchmark, with Mimo V2.5 Pro rated Very Bad under both prompt conditions.
Prompting helped some models, but not all. Better instructions reduced hallucination risk only when the model could follow them reliably.

The broader finding is clear: LLM reliability cannot be judged by model choice alone. The prompt, workflow, and test conditions all affect whether a model refuses unsupported answers or fills in what was never provided.

What JurisTech Tested

JurisTech created a 45-question multiple-choice benchmark with three sections.

Section	Number of questions	Purpose
General knowledge	15	Questions with real, correct answers
Science	15	Questions with no correct answer
Finance	15	Questions with no correct answer

The general knowledge questions came first. The science and finance questions came after, and every answer choice in those two sections was wrong. The same 45 questions were tested under a neutral prompt and a truthful prompt. No retrieval augmentation, agentic workflow, or external tools were used, so the comparison focused on the model’s own response behaviour.

For the full methodology, prompt screenshots, model council setup, scoring details, and model-by-model findings, read the JurisTech 2026 LLM Hallucination Benchmark Research Note.

Benchmark Results

Eight models completed both test conditions.

LLM	Neutral prompt	Truthful prompt
gemini-3.1-pro-preview-high	Very Good	Very Good
gpt-5.5-high	Very Good	Good
grok-4.20-high	Good	Good
gpt-5.4-high	Fair	Very Good
kimi-k2.6-high	Fair	Good
claude-opus-4.7-high	Bad	Fair
glm-5.1-high	Bad	Poor
mimo-v2.5-pro-high	Very Bad	Very Bad

The ratings measure how reliably each model recognised missing information, rejected invalid questions, and avoided inventing values. Very Good reflects the strongest refusal behaviour. Good and Fair indicate more mixed results. Bad, Poor, and Very Bad indicate increasing reliance on unsupported assumptions.

What the Results Show

The strongest models treated missing information as a boundary, rather than a gap to fill.

Gemini 3.1 Pro Preview High was rated Very Good under both prompts, making it the most consistently reliable model in this benchmark. GPT-5.4 High showed the strongest prompt-sensitive improvement, moving from Fair under the neutral prompt to Very Good under the truthful prompt.

GPT-5.5 High and Grok 4.20 High were generally cautious, though less consistent or less precise than the strongest performers. Kimi K2.6 High and Claude Opus 4.7 High improved under the truthful prompt, but did not reach the top tier.

The weakest results came from GLM 5.1 High and Mimo V2.5 Pro High. GLM 5.1 High became less reliable under the truthful prompt, while Mimo V2.5 Pro High remained Very Bad under both prompt conditions.

Across the results, the differences were behavioural as much as technical. Some models were cautious by default. Some became safer when prompted clearly. Others continued producing answers even after acknowledging that information was missing. Those differences become far more important once a model is placed inside a real financial workflow.

Why Prompting Changed the Results

The truthful prompt exposed an important control point. Some models became more cautious when instructed to answer truthfully and honestly. They were more likely to flag missing information and less likely to force an answer when the question could not be answered as stated.

In financial workflows, a neutral prompt can leave the model focused on completion. It may try to pick an answer, satisfy the format, and move on. A truthful prompt gives the model a stronger instruction to stay within the evidence.

The improvement was not universal. GLM 5.1 became less reliable under the truthful prompt, while Mimo V2.5 Pro remained Very Bad under both conditions. Better instructions only help when the model can follow them reliably.

Why Public Benchmarks Need Context

Public hallucination benchmarks are useful, but they do not always predict how a model will behave in a specific financial workflow.

Artificial Analysis’s AA-Omniscience Hallucination Rate ranked Mimo V2.5 Pro second for least hallucination. In JurisTech’s test, the same model ranked last.

That gap shows how much benchmark design can influence the result. Public benchmarks can help with shortlisting, but the real test is how the model behaves on the institution’s own prompts, documents, edge cases, and failure modes.

How Banks Can Reduce AI Hallucination in Finance

From this latest benchmark, three practical steps stand out.

First, require truthfulness and non-speculation in prompts. The system prompt should instruct the LLM to answer truthfully, avoid speculation, and refuse when information is missing.

Second, test models against your own use cases. A model that looks strong in a public benchmark may behave differently when tested against real financial documents, incomplete data, or workflow-specific prompts.

Third, include out-of-band testing. If a prompt is designed for financial analysis, pass in a question with missing variables and observe whether the model invents the missing information. If it is designed to summarise a credit memo, give it an incomplete memo and see whether it flags the gaps or produces a polished summary anyway.

These tests show whether the model understands the task boundary, rather than simply trying to complete the request at all costs.

AI Governance Implications

Reducing AI hallucination risk cannot stop at better prompt-writing. Before an LLM enters a live financial workflow, governance teams should be able to show which prompt was approved, which model was selected, what failure cases were tested, and how the model behaved when the safest answer was to refuse.

Conclusion: Better LLM Decisions Require Better Evidence

A reliable model must answer accurately, but it also needs to recognise when the available information is not enough.

This benchmark tested that behaviour directly. When the questions had no correct answer, the strongest models recognised the missing information and refused to invent what was never provided. The weaker models filled in the gaps, made assumptions, and produced answers that looked complete while resting on unsupported inputs.

For banking leaders, AI hallucination in finance is ultimately a question of evidence. Model choice, prompting, and use-case-specific testing all shape reliability. Public benchmarks can provide useful context, but they cannot replace internal evaluation against real financial scenarios.

JurisTech’s Hallucination Benchmark gives finance teams a clearer way to judge LLMs under missing information, where the most important answer may be the one the model refuses to give.

About JurisTech
JurisTech is a global lending and recovery solutions provider specialising in enterprise-class software for banks, financial institutions, insurance providers, automotive, and telecommunications companies. Founded in 1997, JurisTech supports the full lending lifecycle, from digital onboarding and origination to credit decisioning, documentation, collections, recovery, and enterprise AI adoption. Built on cloud-native, microservices, and composable architecture, its solutions help institutions modernise credit operations with greater speed, scalability, and control. JurisTech has been mentioned across multiple Gartner® reports and delivers on its motto, “360 AI Lending Tech | Fast. Proven. Secure.”

Why AI Hallucination in Finance Matters

Key Findings From JurisTech’s 2026 LLM Hallucination Benchmark

What JurisTech Tested

Benchmark Results

What the Results Show

Why Prompting Changed the Results

Why Public Benchmarks Need Context

How Banks Can Reduce AI Hallucination in Finance

AI Governance Implications

Conclusion: Better LLM Decisions Require Better Evidence

About JurisTech

About the Author: John Lim

Related Posts

Best LLM Tools for Financial Analysis 2026: JurisTech’s Hallucination Benchmark Report

How to Choose the Right Loan Origination System (LOS) for Your Organisation

How JurisTech’s Predictive AI Solution Delivers A Valuable Trust Advantage For Banks