JurisTech’s 2026 LLM Benchmark For AI Hallucination in FinanceLast updated: 14/05/2026Loan applications often contain errors, omissions, or inconsistent information. The key question is whether AI models used to assess these submissions will recognise those gaps or make optimistic assumptions when caution is needed. To test this, JurisTech evaluated GPT 5, Gemini, Claude, Grok, Kimi, GLM, and Mimo using a benchmark designed to measure AI hallucination in finance.Why AI Hallucination in Finance MattersFor banks and lenders exploring AI LLMs (Large Language Models), LLM reliability is more important that just asking the AI to review a company’s credit rating. If the company applying for a loan demonstrates profitability, but the bank statements do not support the facts, the AI should not take things at face value.JurisTech tested this by crafting a set of innocent-looking multiple choice questions which initially has real answers, then later planting a set of plausible but incorrect questions where every available answer choice is wrong.Here we are testing the gullibility of the LLM, and specifically we do not want the LLM to be helpful, but exhibit a strict attitude towards truth and fiduciary duty.This benchmark builds on JurisTech’s earlier test on LLM performance in financial analysis, where models analysed a complex financial report with certain data masked. While that earlier evaluation focused on analytical accuracy under constrained conditions, this latest benchmark looks at the other side of reliability. A reliable model needs to know how to answer, but it also needs to know when to stop.Key Findings From JurisTech’s 2026 LLM Hallucination BenchmarkGemini 3.1 Pro Preview High was the most consistently reliable model, achieving a Very Good rating under both prompt conditions.GPT-5.4 High showed the strongest improvement under the truthful prompt, moving from Fair to Very Good.GPT-5.5 High and Grok 4.20 High were generally cautious, though less consistent or less precise than the strongest performers.Kimi K2.6 High and Claude Opus 4.7 High improved under the truthful prompt, but still showed weaknesses in certain cases.GLM 5.1 High and Mimo V2.5 Pro High were the least reliable in this benchmark, with Mimo V2.5 Pro rated Very Bad under both prompt conditions.Prompting helped some models, but not all. Better instructions reduced hallucination risk only when the model could follow them reliably.The broader finding is clear: LLM reliability cannot be judged by model choice alone. The prompt, workflow, and test conditions all affect whether a model refuses unsupported answers or fills in what was never provided.What JurisTech TestedJurisTech created a 45-question multiple-choice benchmark with three sections.SectionNumber of questionsPurposeGeneral knowledge15Questions with real, correct answersScience15Questions with no correct answerFinance15Questions with no correct answerThe general knowledge questions came first. The science and finance questions came after, and every answer choice in those two sections was wrong. The same 45 questions were tested under a neutral prompt and a truthful prompt. No retrieval augmentation, agentic workflow, or external tools were used, so the comparison focused on the model’s own response behaviour.For the full methodology, prompt screenshots, model council setup, scoring details, and model-by-model findings, read the JurisTech 2026 LLM Hallucination Benchmark Research Note.Benchmark ResultsEight models completed both test conditions.LLMNeutral promptTruthful promptgemini-3.1-pro-preview-highVery GoodVery Goodgpt-5.5-highVery GoodGoodgrok-4.20-highGoodGoodgpt-5.4-highFairVery Goodkimi-k2.6-highFairGoodclaude-opus-4.7-highBadFairglm-5.1-highBadPoormimo-v2.5-pro-highVery BadVery BadThe ratings measure how reliably each model recognised missing information, rejected invalid questions, and avoided inventing values. Very Good reflects the strongest refusal behaviour. Good and Fair indicate more mixed results. Bad, Poor, and Very Bad indicate increasing reliance on unsupported assumptions.What the Results ShowThe strongest models treated missing information as a boundary, rather than a gap to fill.Gemini 3.1 Pro Preview High was rated Very Good under both prompts, making it the most consistently reliable model in this benchmark. GPT-5.4 High showed the strongest prompt-sensitive improvement, moving from Fair under the neutral prompt to Very Good under the truthful prompt.GPT-5.5 High and Grok 4.20 High were generally cautious, though less consistent or less precise than the strongest performers. Kimi K2.6 High and Claude Opus 4.7 High improved under the truthful prompt, but did not reach the top tier.The weakest results came from GLM 5.1 High and Mimo V2.5 Pro High. GLM 5.1 High became less reliable under the truthful prompt, while Mimo V2.5 Pro High remained Very Bad under both prompt conditions.Across the results, the differences were behavioural as much as technical. Some models were cautious by default. Some became safer when prompted clearly. Others continued producing answers even after acknowledging that information was missing. Those differences become far more important once a model is placed inside a real financial workflow.Why Prompting Changed the ResultsThe truthful prompt exposed an important control point. Some models became more cautious when instructed to answer truthfully and honestly. They were more likely to flag missing information and less likely to force an answer when the question could not be answered as stated.In financial workflows, a neutral prompt can leave the model focused on completion. It may try to pick an answer, satisfy the format, and move on. A truthful prompt gives the model a stronger instruction to stay within the evidence.The improvement was not universal. GLM 5.1 became less reliable under the truthful prompt, while Mimo V2.5 Pro remained Very Bad under both conditions. Better instructions only help when the model can follow them reliably.Why Public Benchmarks Need ContextPublic hallucination benchmarks are useful, but they do not always predict how a model will behave in a specific financial workflow.Artificial Analysis’s AA-Omniscience Hallucination Rate ranked Mimo V2.5 Pro second for least hallucination. In JurisTech’s test, the same model ranked last.That gap shows how much benchmark design can influence the result. Public benchmarks can help with shortlisting, but the real test is how the model behaves on the institution’s own prompts, documents, edge cases, and failure modes.How Banks Can Reduce AI Hallucination in FinanceFrom this latest benchmark, three practical steps stand out.First, require truthfulness and non-speculation in prompts. The system prompt should instruct the LLM to answer truthfully, avoid speculation, and refuse when information is missing.Second, test models against your own use cases. A model that looks strong in a public benchmark may behave differently when tested against real financial documents, incomplete data, or workflow-specific prompts.Third, include out-of-band testing. If a prompt is designed for financial analysis, pass in a question with missing variables and observe whether the model invents the missing information. If it is designed to summarise a credit memo, give it an incomplete memo and see whether it flags the gaps or produces a polished summary anyway.These tests show whether the model understands the task boundary, rather than simply trying to complete the request at all costs.AI Governance ImplicationsReducing AI hallucination risk cannot stop at better prompt-writing. Before an LLM enters a live financial workflow, governance teams should be able to show which prompt was approved, which model was selected, what failure cases were tested, and how the model behaved when the safest answer was to refuse.Conclusion: Better LLM Decisions Require Better EvidenceA reliable model must answer accurately, but it also needs to recognise when the available information is not enough.This benchmark tested that behaviour directly. When the questions had no correct answer, the strongest models recognised the missing information and refused to invent what was never provided. The weaker models filled in the gaps, made assumptions, and produced answers that looked complete while resting on unsupported inputs.For banking leaders, AI hallucination in finance is ultimately a question of evidence. Model choice, prompting, and use-case-specific testing all shape reliability. Public benchmarks can provide useful context, but they cannot replace internal evaluation against real financial scenarios.JurisTech’s Hallucination Benchmark gives finance teams a clearer way to judge LLMs under missing information, where the most important answer may be the one the model refuses to give.About JurisTechJurisTech is a global lending and recovery solutions provider specialising in enterprise-class software for banks, financial institutions, insurance providers, automotive, and telecommunications companies across Malaysia, Southeast Asia, and beyond. Founded in 1997, JurisTech supports the full lending lifecycle, from digital onboarding and loan origination to credit decisioning, documentation, collections, recovery, and enterprise AI adoption.Built on cloud-native, microservices, and composable architecture, JurisTech’s solutions help financial institutions modernise critical credit operations with greater speed, scalability, and control. JurisTech has been mentioned across multiple Gartner® reports, including as a Representative Provider for Lending Ecosystems, a Representative Vendor for Commercial Loan Origination Solutions, and a Sample Vendor for Commercial Banking Onboarding. JurisTech was also referenced in Gartner reports covering AI agents in loan orchestration and trade finance, predictive AI and synthetic data for lending risk assessment, and essential AI services for banking. Living by its motto, “360 AI Lending Tech | Fast. Proven. Secure.”, JurisTech is committed to redefining financial services to uplift lives, strengthen economies, and create lasting industry impact.By John Lim|2026-05-14T17:22:24+08:0014th May, 2026|Featured, Insights| About the Author: John Lim John is an award-winning technopreneur with many years of experience in software development. He is the co-founder and CTO of JurisTech. Related Posts Best LLM Tools for Financial Analysis 2026: JurisTech’s Hallucination Benchmark Report 15th April, 2026 How to Choose the Right Loan Origination System (LOS) for Your Organisation 3rd March, 2026 How JurisTech’s Predictive AI Solution Delivers A Valuable Trust Advantage For Banks 3rd September, 2025