Automate Search Relevance Testing

When it comes to search, traditional testing strategies should be adjusted to fit the unique conditions of relevance testing. If your dev team has invested in automated testing, it may seem fairly straightforward to use the same techniques to test search relevance. Sadly, this isn’t a successful strategy. I’ll explain why, but first let’s review traditional approaches to testing.

What’s the difference between a unit test and regression test? Unit tests and regression tests are types of automated tests. They are considered automated as opposed to being manual. A manual test requires a human to do the testing, either in an ad-hoc, exploratory, or “smoke” test (where there’s smoke, there’s probably fire), or by running through a set of predefined steps and checking off the pass/fail result of each step. Manual testing is very common and exploratory testing is a necessary component of software testing. But if you’ve ever done any manual testing yourself you know it is tedious and prone to error. Why? Because humans, unlike computers, are good at creative tasks but we lose focus and make mistakes when we are doing boring, repetitive tasks. But computers excel at boring, repetitive tasks. So why not harness them to do that work for us? That’s the essence of automated testing. If you can define the tests in the form of a test script (a set of steps to follow), then a computer is better suited to run through that script over and over and over again, checking for errors whenever the system changes.

Unit Tests

Unit tests and regression tests are common in software development. Unit tests are very focused on testing code pieces (units) of code, rather than what the end-user might experience. A unit test is a quick, automated test that a developer will run frequently as code changes in order to ensure every unit of code is behaving as expected. Unit tests can also be used as a sort of scaffold to actually build the code. This is known as test-driven development. In that case, the tests are created first, almost like code-level acceptance criteria. In any case, unit tests are code.

Regression Tests

Regression tests aren’t a particular type of test so much as a suite of tests used to verify that changes to the code do not have unforeseen side-effects. It’s common when changing complex systems that a part of the system that was working is broken by a seemingly unrelated change. Regression testing is the process of revisiting every test that was working to make sure it is still working. Unit tests are regression tests insofar as they are used to double-check that everything in the code still works.

Other Tests

Other tests get closer to the end-user’s point of view. Integration tests verify the points of interaction between parts of a larger system. These integration points could be between different modules of code, or across a network boundary such as when a front-end communicates with a back-end. End-to-end tests mimic an actual user interacting with the system. End-to-end tests will rely on automation tools that can record paths through the system that a user might take (e.g., checkout process, or sign-up). Load tests test the performance of the system under different loads to ensure that the system’s capacity meets the needs.

Search Relevance Testing

All these automated tests are useful when building software systems, but with search the ability to test things gets a little less clear. Unlike familiar software development, the behavior of a search system is not as clearly deterministic. The goal of a search system is to find and order a set of results based on some criteria. The first requirement of an automated test is repeatability – the test must have the same output (given the same input) every time. If you don’t have repeatable test outcomes then you can’t automate. Search, on the other hand, will be change based on what is in the index. As the index changes, the scores will change based on statistics and probability.

So what can you do with a probabilistic outcome such as you’d get from search? First, you have to have a stable index. A stable index means that you are not adding new documents and that the documents are indexed in the same order, and that your server configuration doesn’t change — because the way documents are scored will be affected by how many servers you are using. Second, you will need a set of test queries and a rating system for the results of each query.

Once you have that, you can automate testing by enumerating your test queries and checking that the results came back in the order you’d expect. But here is where it gets tricky. Because the goal of search is relevant results, you won’t be changing code but you will be changing the way searches are performed. This is achieved by different configurations of your searches using the API of the search engine. As you change your search algorithm, your results will absolutely change in ways that are almost impossible to anticipate.

The way to automate testing for search results, therefore, is not to use a deterministic approach where each query is expected to have each result in a predetermined position. Rather, it’s more effective and useful to measure your search results in aggregate and to set thresholds above which the tests “pass” and below which they “fail”. For instance, you can use a common search metric like MRR (Mean Reciprocal Rank), which is computed over all the queries in your test collection. As you improve your search algorithm, MRR will also improve. You can set a minimum MRR that you can monitor during testing. If your changes adversely affect MRR, you’ll know by measuring it. If that effect is enough to cross your threshold, your testing framework can catch it and “fail” the test.

If this sounds counterintuitive, that’s because it is. Working with non-deterministic systems is often new ground for programmers and devops teams. Search systems are special types of databases that require a different approach to testing.