Testing LLM-based systems goes well beyond traditional software QA. Unlike typical rule-based systems, LLMs are probabilistic, generative, and non-deterministic. Therefore we apply structured, purpose-driven QA strategies to ensure that AI systems are not only functional, but reliable, safe, and aligned with user expectations.
For automating regression tests, we tend to integrate other LLMs (e.g. OpenAI) and use them as a sort of benchmark during the the verification process. To automatically verify that an LLM is still functioning as expected, we use techniques like relevancy checks and similarity checks.
In a relevancy check, we take an input prompt, run it through the LLM we're testing, and also send the same prompt to another trusted and consistent model (OpenAI, Mistral, etc.) via API call. We then pass both outputs to the benchmark LLM and, using as special prompt, we measure a "temperature", i.e. how significant the difference between two outputs are. This is an effective and reliable strategy for catching regressions automatically where the output of LLM misses the point or hallucinates unexpectedly.
In a similarity check, we use almost the same strategy, but the key difference lies in fact that we do not generate an output from benchmark LLM. Instead, we pass it pre-defined expected output as a simple string, and ask it to measure the temperature between the given (expected output) string, and the actual output from LLM that we are testing.
These strategies allow us to write reliable and effective automated regression tests in spite of LLMs' non-deterministic and probabilistic nature.