• JurisTech’s 2026 LLM Benchmark For AI Hallucination in Finance

    JurisTech's 2026 LLM Benchmark for AI Hallucination in Finance

    Last updated: 14/05/2026

    Loan applications often contain errors, omissions, or inconsistent information. The key question is whether AI models used to assess these submissions will recognise those gaps or make optimistic assumptions when caution is needed. To test this, JurisTech evaluated GPT 5, Gemini, Claude, Grok, Kimi, GLM, and Mimo using a benchmark designed to measure AI hallucination in finance.

    Why AI Hallucination in Finance Matters

    For banks and lenders exploring AI LLMs (Large Language Models), LLM reliability is more important that just asking the AI to review a company’s credit rating. If the company applying for a loan demonstrates profitability, but the bank statements do not support the facts, the AI should not take things at face value.

    JurisTech tested this by crafting a set of innocent-looking multiple choice questions which initially has real answers, then later planting a set of plausible but incorrect questions where every available answer choice is wrong.

    Here we are testing the gullibility of the LLM, and specifically we do not want the LLM to be helpful, but exhibit a strict attitude towards truth and fiduciary duty.

    This benchmark builds on JurisTech’s earlier test on LLM performance in financial analysis, where models analysed a complex financial report with certain data masked. While that earlier evaluation focused on analytical accuracy under constrained conditions, this latest benchmark looks at the other side of reliability. A reliable model needs to know how to answer, but it also needs to know when to stop.

    Key Findings From JurisTech’s 2026 LLM Hallucination Benchmark

    • Gemini 3.1 Pro Preview High was the most consistently reliable model, achieving a Very Good rating under both prompt conditions.
    • GPT-5.4 High showed the strongest improvement under the truthful prompt, moving from Fair to Very Good.
    • GPT-5.5 High and Grok 4.20 High were generally cautious, though less consistent or less precise than the strongest performers.
    • Kimi K2.6 High and Claude Opus 4.7 High improved under the truthful prompt, but still showed weaknesses in certain cases.
    • GLM 5.1 High and Mimo V2.5 Pro High were the least reliable in this benchmark, with Mimo V2.5 Pro rated Very Bad under both prompt conditions.
    • Prompting helped some models, but not all. Better instructions reduced hallucination risk only when the model could follow them reliably.

    The broader finding is clear: LLM reliability cannot be judged by model choice alone. The prompt, workflow, and test conditions all affect whether a model refuses unsupported answers or fills in what was never provided.

    What JurisTech Tested

    JurisTech created a 45-question multiple-choice benchmark with three sections.

    SectionNumber of questionsPurpose
    General knowledge15Questions with real, correct answers
    Science15Questions with no correct answer
    Finance15Questions with no correct answer

    The general knowledge questions came first. The science and finance questions came after, and every answer choice in those two sections was wrong. The same 45 questions were tested under a neutral prompt and a truthful prompt. No retrieval augmentation, agentic workflow, or external tools were used, so the comparison focused on the model’s own response behaviour.

    For the full methodology, prompt screenshots, model council setup, scoring details, and model-by-model findings, read the JurisTech 2026 LLM Hallucination Benchmark Research Note.

    Benchmark Results

    Eight models completed both test conditions.

    LLMNeutral promptTruthful prompt
    gemini-3.1-pro-preview-highVery GoodVery Good
    gpt-5.5-highVery GoodGood
    grok-4.20-highGoodGood
    gpt-5.4-highFairVery Good
    kimi-k2.6-highFairGood
    claude-opus-4.7-highBadFair
    glm-5.1-highBadPoor
    mimo-v2.5-pro-highVery BadVery Bad

    The ratings measure how reliably each model recognised missing information, rejected invalid questions, and avoided inventing values. Very Good reflects the strongest refusal behaviour. Good and Fair indicate more mixed results. Bad, Poor, and Very Bad indicate increasing reliance on unsupported assumptions.

    What the Results Show

    The strongest models treated missing information as a boundary, rather than a gap to fill.

    Gemini 3.1 Pro Preview High was rated Very Good under both prompts, making it the most consistently reliable model in this benchmark. GPT-5.4 High showed the strongest prompt-sensitive improvement, moving from Fair under the neutral prompt to Very Good under the truthful prompt.

    GPT-5.5 High and Grok 4.20 High were generally cautious, though less consistent or less precise than the strongest performers. Kimi K2.6 High and Claude Opus 4.7 High improved under the truthful prompt, but did not reach the top tier.

    The weakest results came from GLM 5.1 High and Mimo V2.5 Pro High. GLM 5.1 High became less reliable under the truthful prompt, while Mimo V2.5 Pro High remained Very Bad under both prompt conditions.

    Across the results, the differences were behavioural as much as technical. Some models were cautious by default. Some became safer when prompted clearly. Others continued producing answers even after acknowledging that information was missing. Those differences become far more important once a model is placed inside a real financial workflow.

    Why Prompting Changed the Results

    The truthful prompt exposed an important control point. Some models became more cautious when instructed to answer truthfully and honestly. They were more likely to flag missing information and less likely to force an answer when the question could not be answered as stated.

    In financial workflows, a neutral prompt can leave the model focused on completion. It may try to pick an answer, satisfy the format, and move on. A truthful prompt gives the model a stronger instruction to stay within the evidence.

    The improvement was not universal. GLM 5.1 became less reliable under the truthful prompt, while Mimo V2.5 Pro remained Very Bad under both conditions. Better instructions only help when the model can follow them reliably.

    Why Public Benchmarks Need Context

    Public hallucination benchmarks are useful, but they do not always predict how a model will behave in a specific financial workflow.

    Artificial Analysis’s AA-Omniscience Hallucination Rate ranked Mimo V2.5 Pro second for least hallucination. In JurisTech’s test, the same model ranked last.

    That gap shows how much benchmark design can influence the result. Public benchmarks can help with shortlisting, but the real test is how the model behaves on the institution’s own prompts, documents, edge cases, and failure modes.

    How Banks Can Reduce AI Hallucination in Finance

    From this latest benchmark, three practical steps stand out.

    First, require truthfulness and non-speculation in prompts. The system prompt should instruct the LLM to answer truthfully, avoid speculation, and refuse when information is missing.

    Second, test models against your own use cases. A model that looks strong in a public benchmark may behave differently when tested against real financial documents, incomplete data, or workflow-specific prompts.

    Third, include out-of-band testing. If a prompt is designed for financial analysis, pass in a question with missing variables and observe whether the model invents the missing information. If it is designed to summarise a credit memo, give it an incomplete memo and see whether it flags the gaps or produces a polished summary anyway.

    These tests show whether the model understands the task boundary, rather than simply trying to complete the request at all costs.

    AI Governance Implications

    Reducing AI hallucination risk cannot stop at better prompt-writing. Before an LLM enters a live financial workflow, governance teams should be able to show which prompt was approved, which model was selected, what failure cases were tested, and how the model behaved when the safest answer was to refuse.

    Conclusion: Better LLM Decisions Require Better Evidence

    A reliable model must answer accurately, but it also needs to recognise when the available information is not enough.

    This benchmark tested that behaviour directly. When the questions had no correct answer, the strongest models recognised the missing information and refused to invent what was never provided. The weaker models filled in the gaps, made assumptions, and produced answers that looked complete while resting on unsupported inputs.

    For banking leaders, AI hallucination in finance is ultimately a question of evidence. Model choice, prompting, and use-case-specific testing all shape reliability. Public benchmarks can provide useful context, but they cannot replace internal evaluation against real financial scenarios.

    JurisTech’s Hallucination Benchmark gives finance teams a clearer way to judge LLMs under missing information, where the most important answer may be the one the model refuses to give.

    About JurisTech

    JurisTech is a global lending and recovery solutions provider specialising in enterprise-class software for banks, financial institutions, insurance providers, automotive, and telecommunications companies across Malaysia, Southeast Asia, and beyond. Founded in 1997, JurisTech supports the full lending lifecycle, from digital onboarding and loan origination to credit decisioning, documentation, collections, recovery, and enterprise AI adoption.

    Built on cloud-native, microservices, and composable architecture, JurisTech’s solutions help financial institutions modernise critical credit operations with greater speed, scalability, and control. JurisTech has been mentioned across multiple Gartner® reports, including as a Representative Provider for Lending Ecosystems, a Representative Vendor for Commercial Loan Origination Solutions, and a Sample Vendor for Commercial Banking Onboarding. JurisTech was also referenced in Gartner reports covering AI agents in loan orchestration and trade finance, predictive AI and synthetic data for lending risk assessment, and essential AI services for banking. Living by its motto, “360 AI Lending Tech | Fast. Proven. Secure.”, JurisTech is committed to redefining financial services to uplift lives, strengthen economies, and create lasting industry impact.

    By |2026-05-14T17:22:24+08:0014th May, 2026|Featured, Insights|

    About the Author:

    John is an award-winning technopreneur with many years of experience in software development. He is the co-founder and CTO of JurisTech.