With automatically generating test cases, there are three central problems:
- Do the automatically generated tests capture the intent of the programmer?
- How can input values be selected?
- How can the correct output be determined?
The answer to most of these questions will be quite unsatisfying. The generated tests cannot infer programmer intent and just characterize the code as it is. Therefore, the outputs/behaviours will generally just be the current outputs/behaviours but will not detect mistakes in the code! This is useful for generating regression tests (e.g. prior to a refactoring), but is not very useful as part of quality assurance.
To derive interesting input values, consider two strategies.
First, you might consider symbolic execution of the unit under test. For example, when hitting a statement if (x < 5) { ... }
then you have two input classes {x < 5, x ≥ 5} for which you can continue symbolic execution. Later, you can select representative values for each input class. Tricky aspects include dealing with the complex semantics of the target language, handling loops and recursion, handling relationships between variables, and other problems with the exponential explosion of the state space. There is substantial academic literature on symbolic execution. As a simple strategy, consider sampling random paths through the control flow graph.
Second, you could treat the unit under test as a black box and generate values without direct feedback, but possibly randomly. This is expensive, but typically works quite well if the code doesn't have code paths that only trigger for certain magic numbers. E.g. if (x == 30752474) { ... }
is difficult to cover. If coverage data can be used for feedback, this turns into a search/optimization problem. There is substantial academic literature on generating interesting inputs, for example using evolutionary algorithms. There are also existing tools to exercise the input space of some code, e.g. libraries like QuickCheck or fuzzing tools like AFL.
As mentioned, the automatically generated tests will be largely useless for QA purposes since they can just verify that the code continues to do what it previously did. Things become more interesting if we define behaviour that should not occur, e.g. null pointer exceptions. Fuzz testing is a fairly brute force approach for looking at illegal states. With property-based testing (e.g. QuickCheck), the tester no longer provides specific examples to test, but states properties about the behaviour that should always hold. The testing library then uses a black-box strategy to generate values, with the goal of finding a counter-example to the stated property. For example, we might have a system under test and a property test as follows (pseudocode):
// doubles its argument int SystemUnderTest(int a) { return a * 2; } void Test(int a) { Assert(SystemUnderTest(a) / 2 == a); }
A property-based testing library that knows how to generate interesting values for an int
type will quickly find counterexamples such as a = int.MaxValue
.
In some cases, you might be heuristically able to infer the programmer's intent and then programmatically state interesting properties for which you can then generate test inputs.
But in general, a cooperative approach between tooling and programmer might be more beneficial:
- make property-based testing easy
- use automated tools such as fuzzers to find example inputs that hit uncovered code
- if necessary, use automatically generated test cases to characterize the status quo behaviour of the software