Statistical significance: Is my new bot really better than the previous one?
Open Source Your Knowledge, Become a Contributor
Technology knowledge has to be shared and made accessible for free. Join the movement.
A test statistic (generically named T) indirectly measures the unlikeliness that some observed result can be explained by random deviation from a simple reference case where nothing special happens (examples: The coin is fair, the dice is fair, the two bots are as good as each other). In the use scope of this paper, the test statistic we’ll calculate is named , however we’ll use a signed version of it.
Use scope
- The formulas below are for comparing two bots only: Games with more than two bots (1 vs. 1 vs. 1 for instance) are not covered here.
- The formulas allows for draws, and you don’t need to know how the game behaves for a match between two identical bots.
Formula of statistical significance
with:
- W the number of matches won by the new bot,
- D the number of drawn matches,
- L the number of matches lost by the new bot,
- N = W + D + L the number of matches played.
Proof
A test statistic compares actual observation with what we would expect if nothing special happens: That expected outcome is the null hypothesis H₀, and T = 0 if the actual observation exactly fits H₀. As the outcomes to compare are triplets (values being labeled respectively “wins”, “draws” and “losses” for the new bot) for which we increment one of the values at each match result, we’re comparing trinomial distributions. We’ll thus use Pearson’s χ² test of goodness of fit, which is an asymptotic test statistic, not exact but precise enough for our need (calculating an exact test statistic would cost a lot of computing resources for a slight and useless gain in precision).
In our case the null hypothesis is that the two bots are as good as each other (if not identical), so the expected distribution is an equal number of wins and losses; The expected number of draws is undefined because it depends on the game rules and on the bot (especially if it does not play perfectly), this is a free parameter for H₀.
So, for a given number N of matches played, the actual observation is a (W, D, L) triplet while the expected outcome is a (α, N−2α, α) triplet with α ∈ [0, N/2] (otherwise the triplet would contain negative values).
We have 2 degrees of freedom for the actual observation (as N = W + D + L is fixed, we can choose freely only two of the three values) and 1 free parameter for H₀ (namely α), so the remaining number of degrees of freedom for the test statistic is 2 − 1 = 1: Pearson’s χ² test statistic is thus
Whatever its degrees of freedom, Pearson’s χ² test statistic is:
with Oᵢ actual observations and Eᵢ expected values. So, in our case:
The free parameter α of H₀ is estimated by minimizing
We have an extremum when the first derivative vanishes:
As we want α₀ ∈ [0, N/2], N−2α₀ and α₀ are both positive, so we use the positive root:
As we have taken the positive root, N−2α₀ and α₀ are either both positive or both negative, but that latter case is impossible as N > 0, so α₀ ∈ [0, N/2] as expected.
so
A χ² test statistic tells only how far we are from H₀, without a notion of direction: Its value is the same by exchanging W and L. As we have only 1 degree of freedom, we can introduce a direction by adding a sign to the test statistic (Note: It has no meaning for more than 1 d.f.). As we want to evaluate the new bot against the old one, we want the test statistic to be positive if W > L and negative if W < L (given H₀, it is already 0 if W = L):
Note: You don’t need to bother about sign(0), as the right-hand multiplier’s value is 0 if W = L.
So you can simply implement the sign function as sign(W, L) := W > L ? 1 : -1
.
Numerical interpretation
Below is a table giving the relation between the absolute value of T and the invert of the asymptotic one-tailed p-value, i.e., of the probability p₁ that mere luck could have given the observed result or something more extreme, for some values of T. Graphs above show that p₁ is not about “better than” but about “more extreme than”, i.e., “farther from 0”, hence the flip when W < L. Values for 1/p₁ are rounded to 2 significant digits only: The test statistic is asymptotic so the probability is not exact, and the precise probability is not important, its order of magnitude is enough for a good feeling of what it represents.
|T| | 1/p₁ |
---|---|
0 | 2 |
10 | 1300 |
20 | 260ᴇ3 |
30 | 46ᴇ6 |
40 | 7.9ᴇ9 |
50 | 1.3ᴇ12 |
60 | 210ᴇ12 |
70 | 34ᴇ15 |
80 | 5.3ᴇ18 |
90 | 840ᴇ18 |
100 | 130ᴇ21 |
Proof
|T| is
Note: The symbol “
By definition of Pearson’s χ² test statistic with 1 d.f. the asymptotic two-tailed p-value p₂ for |T| verifies:
with “erf” the error function. So
with “erfc” the complementary error function. As p₁ = p₂/2, we finally have 1/p₁ = 2/erfc(√(|T|/2)).
For instance, as the world population will soon reach 8 billion people, we can expect that, at any given time, someone on the planet is reaching a statistical significance of |T| ≈ 40 for some fair random process: Probably someone else, but possibly you while measuring your new bot performance (very unlikely, but still reasonably possible).
Formula of win rate
There are several ways to define a win rate taking drawn matches into account, the simplest one is to count draws as half wins and half losses:
Practical use
Use a run length limit of 5000 to 10,000 matches (depending on the time it would take) and a threshold for |T| around 50:
- no less than 40, as 1/p₁ would be below the world population,
- no more than 60, as the statistical significance is high enough.
Once the length limit or the threshold is reached, launch a new independent run in order to verify that the result is similar. Note: in case of win rate being precisely 50 % (identical bots) or extremely close to that, T will vary randomly in the [−10, 10] range (with possible temporary pikes outside that range), so the first run may end with T ≈ 10 when the length limit is reached (after a lot of movements in the range) and the second run with T ≈ −10, they will nevertheless be “similar” in the sense that |T| does not increase visibly on average.