AI Testing

How to Test an AI Chatbot for Accuracy and Reliability

11 min read20 January 2026By Elmonds Kreslins

Why testing AI chatbots is different

Traditional software either does what it's supposed to or it doesn't. An AI chatbot can produce a response that looks correct, reads naturally, and is completely wrong. Standard functional testing, "does the button work? does the form submit?", simply isn't sufficient for AI products. You need a fundamentally different testing approach.

Our engineers tested a patient-facing ChatGPT system at Bupa, grounded against Bupa's own knowledge base. This guide captures what we learned about what to test and how.

The four dimensions of AI chatbot testing

1. Response accuracy

Accuracy testing for a chatbot means: does the response correctly answer the question, using the information available to the system?

For RAG systems (Retrieval-Augmented Generation, i.e. chatbots grounded against a document store or knowledge base), accuracy means: is the response derived from the source material, or is the model generating answers beyond what's in the knowledge base?

How to test accuracy:

Build a query corpus: a large set of representative questions users are likely to ask
For each query, identify the correct answer according to the knowledge base
Run each query through the chatbot and compare the response to the expected answer
Flag responses that are factually wrong, incomplete, or that draw on information outside the knowledge base

2. Hallucination detection

Hallucination is when an LLM generates plausible-sounding content that isn't supported by (or actively contradicts) its source material. For customer-facing applications, this is a critical failure mode.

Hallucination test techniques:

Out-of-scope queries: ask about topics explicitly not in the knowledge base and verify the model declines or redirects gracefully
Contradictory prompts: provide premises that conflict with the knowledge base and verify the model doesn't accept them
Adversarial prompts: prompt injection attempts to override system instructions
Source citation check: if the system cites sources, verify citations are accurate

3. Latency and performance

A patient-facing or customer-facing chatbot must respond within a threshold that feels natural. Our Bupa work identified that responses taking more than 8–10 seconds create significant user experience friction for healthcare queries where users are often anxious.

What to measure:

Time to first token (TTFT), how quickly does the response begin streaming?
Total response time for typical queries
Latency under concurrent load, does it degrade with 10 simultaneous users? 50?
Latency for complex multi-turn conversations vs. single-turn queries

4. UI and user experience

The UI wrapping an AI chatbot has its own quality concerns beyond the model's accuracy:

Is the chat interface accessible? (keyboard navigation, screen reader support, sufficient contrast)
How does the UI handle very long responses? Does it overflow or display correctly?
Error states: what happens when the API is slow or returns an error?
Is it clear to users they're talking to an AI?
Can users easily restart the conversation or access human support?

Building a regression prompt library

One of the most valuable things you can build for an AI product is a curated prompt library, a set of queries that represent critical use cases, known edge cases, and previously discovered failure modes. Run this library against every new model version or knowledge base update. Any response that changes should be reviewed for quality regression.

RedQA's AI/LLM Testing service includes building this prompt library as part of the engagement. See how we work, or get in touch for a consultation.

Elmonds Kreslins

Lead QA Engineer

Elmonds has led QA programmes at BBC, Bupa, and multiple UK fintech startups. He founded RedQA to give growing product teams access to the same quality rigour as enterprise engineering teams, without the overhead.

Book a call with Elmonds

QA insights, monthly

No spam. Unsubscribe any time.

Get practical QA guides, testing tips, and industry news sent straight to your inbox. Join engineers and product teams from across the UK.

AI Testing8 min read

LLM Testing: What It Is and Why It Matters in 2026

LLM testing is a new discipline addressing the unique quality risks of large language model integrations. Here's what it covers, why it's different from traditional QA, and how to get started.

5 January 2026Read

AI Testing9 min read

AI Hallucination Testing: How to Catch Inaccurate AI Responses Before Users Do

AI hallucination is when a model generates plausible but false content. This guide covers detection techniques, test design, and how to build guardrails into your AI product.

15 December 2025Read

QA Fundamentals7 min read

What is ISTQB and Do QA Engineers Actually Need It?

ISTQB is the most widely held software testing qualification. This guide explains what it is, what the certification covers, and an honest take on whether it is worth your time.

20 May 2026Read

Ready to Ship with Confidence?

Let's discuss how RedQA can help you deliver better software, faster. Get a free consultation and quote tailored to your project.

Get a Free Quote Book a 30-min call