Z Score Definition:
The z-score is the statistical measure of the difference between two randomly drawn samples. It is expressed as the number of standard deviations a test result is from the mean in the standard normal distribution. The larger the z-score the higher the probability that the two means have not been drawn from the same population and the difference between the samples is statistically significant.
When you standardise a score (calculate the average score) the mean is always 0, and the standard deviation (variance) is in increments of 1. The value of a z-score starts at -3 standard deviations (on the left of the normal distribution curve) and rises to +3 standard deviations (on the right side of the normal distribution curve).
To use a z-score for a statistical test you will often be asked for the mean and the standard deviation of the population distribution. In practice you may have to use the sample mean and standard deviation instead if you the statistics for the population are unknown.
The z score table above, also known as the standard normal distribution table, shows the area between the mean (0) and the score. This allows us to estimate the area within a certain number of standard deviations which we use when calculating a confidence level or interval.
To estimate the percentage of the area which falls within a certain number of standard deviations of the mean (±) we would need to multiply the score by 2 because the above table only shows figures for one side of the normal distribution. If we want to know the percentages below (to the left) for the whole area of the normal distribution we need to add 0.5 to the z-score.
When you see a z score table with values starting at 0.500 (see below in section 1. – How is the z score used), this is a positive z score table. The scores in the positive table have been adjusted to represent the percentage of values below (to the left) a positive z score in the standard normal distribution (the whole normal distribution to – 3 standard deviations). We will use this kind of z score table later to estimate the values to the left of our test score.
1. How is the Z Score Used?
The z-score and table allows you to compare the results of a test or survey to a “normal” population (i.e. the mean). To achieve a “normal” distribution your sample needs to be drawn randomly to ensure it is representative and it is a large sample (i.e. over 100 as an absolute minimum). The z-score indicates how your test result compares to the population’s mean for the success metric.
Consider we have a test result which generates a z score of 1.59. To find the percentage value to the left of a positive z score we use the table below which uses decimal figures to display the percentage of the area below the test score.
Look down the left-hand side column to find the one decimal place z-score (1.5) and then look up the second decimal place along the top of the table (.09). Then look across the table to where they converge and you will see a z-score of 0.94408.
Standard Normal Distribution: Area to the Left of Z score
This corresponds to 94.41% of the standard normal distribution being below (to the left) of the test score. That’s because we multiply the decimal value by 100 to get the percentage and rounded up to the nearest two decimal places.
2. The 68-95-99.7 Rule:
There is a easy way of remembering what proportion of the area under the curve is accounted for by different standard deviations from the mean (0).
- Approximately 68% of the population are within ± one standard deviation of the mean.
- Approximately 95% of the population are within ± two standard deviations of the mean.
- Approximately 99.7% of the population are within ± three standard deviations of the mean.
You can calculate these figures by using the z score table below which shows the area between the mean (0) and the z-score. However, this only gives you the score for positive standard deviations. To convert the figure to both sides of the distribution (i.e. ±) you will need to multiply the score by 2. For example, go to 1 standard deviation in the table (.3413) and multiply it by 2 will give you a score of ± 1 (.6826) or 68%.
The most common z score used to determine a confidence interval is 1.96. Let’s use the table below to estimate the proportion of values that are accounted for by 1.96 standard deviations. As you can see this generates a score of 0.4750 or 47.5%. However, this only represents one side (tail) of the curve and so we need to multiply it by 2 to give us ± 0.95 or ± 95%.
You can’t do this with the other z table we use as that gives the area left of the standard deviation and all the way to – 3 standard deviations. That’s why you need to be careful to use the appropriate z score table dependent upon what you are trying to calculate.
3. A Negative Z Score Table:
When we want to estimate the area under the normal distribution to the left of a negative z score we can use a negative z score different table. This gives the values in the shaded area of the curve to the left of the negative z score. We would use this if our test score was below the population mean and we wanted to test if the score was significantly lower than the mean (i.e. a one-tail test).
4. The Z-Score Formula – Single Sample:
The formula to calculate the z-score for a single sample is as follows:
- Z = ( x – µ ) / σ
- Where x = test score
- µ = test mean
- σ = standard deviation of the mean
For example, let’s say your experiment generated a score of 200, the mean is 140 and the standard deviation is 30.
- Z = (200 – 140) / 30 = 2
This shows that your test score is 2 standard deviations above the mean. In many instances we don’t have know the population mean or the standard deviation. For this reason we are often forced to use the sample mean and standard deviation. However, you should only use the sample statistics if we are confident the sample has been drawn using an appropriate random method and is sufficiently large to assume a normal distribution. Be careful not to fall into the trap of the law of small numbers.
5. The Z-Score Formula – Two Samples:
With most experiments we have multiple samples because we want to compare our test result with that of the default (the existing experience). In such instances we need to measure the standard deviation of the different sample means (i.e. the standard error).
Z = ( x – µ ) / ( σ /√n )
Here the z-score describes how many standard errors there compared to the sample and population mean.
It is also used as the critical value in the calculation of the margin of error in an experiment. The margin of error is calculated using either of these equations.
Margin of error = Critical value x Standard deviation of the statistic
Margin of error = Critical value x Standard error of the statistic
When you know the standard deviation of the statistic you can use the first equation. Alternatively use the second equation.
6. Calculating the Z-Score For Multiple Samples:
Let’s say we conduct an experiment involving a new registration form. The mean conversion rate for the existing form is 20%, with a standard deviation of 5%. What is the probability of increasing conversion to 35% with a sample of 100 sessions.
- Z = ( x – µ ) / ( σ /√n )
- Z = (35 – 20) / (5 / √100) = 15 / 0.05 = 3
We already know that 99.7% of values within a normal distribution fall within 3 standard deviations of the mean. This tells us that there is only around a 0.3% probability that the existing registration form could achieve a conversion rate of 35%. We can, therefore, be confident that such an increase in conversion would likely to be down to the new experience and not a random result.
Statistical significance is dependent upon more than a formula and the outputs. The formula we use for calculating the z-score is based upon certain assumptions being met. These relate to the sample size being large (i.e. over 100) and the sample being drawn using an appropriate random method (random number cluster). Unless these assumptions are met your high level of statistical significance can be meaningless and misleading.
7. Skewness and Kurtosis:
Despite the distribution of many phenomena following a bell-like curve, in reality the vast majority of data sets don’t create a perfect normal distribution. It’s more common for a data set to display some skewness to the left or the right and not to have a symmetrical peak.
Skewness is a measure of the asymmetry of the distribution of a variable. The skewness of a normal distribution is usually zero, indicating a symmetrical distribution, though this may not always be the case. When the skewness is less than zero (negative skewness) it means the left-hand tail is longer than the right-hand tail. This indicates the bulk of observations lie to the right of the mean. A positive skewness indicates that the right-hand tail is longer than the left side of the curve and the majority of observations lie to the left of the mean.
Kurtosis is an indication of the peakedness and thickness of the tail ends of a distribution. A distribution with a large kurtosis will have a tail longer than the normal distribution (e.g. five or more standard deviations from the mean). The normal distribution has a kurtosis of three and so many statistical solutions use the ‘excess kurtosis’ which is obtained by subtracting 3 from the kurtosis (proper). Thus the excess kurtosis for a standard normal distribution will be zero.
A positive excess kurtosis distribution is referred to as a leptokurtic distribution, meaning a high peak. A negative excess kurtosis distribution is called a platykurtic distribution, indicating a flat topped curve.
8. Normality Test Using Skewness and Kurtosis:
We can conduct a normality test using skewness and kurtosis to identify if our data is close to a normal distribution. A z score can be calculated by dividing the skew value or excess kurtosis by their standard error.
Z = Skew value / SE skewness
Z = Excess kurtosis / SE excess kurtosis
However, because the standard error becomes smaller as the sample size increases, z-tests with a null hypothesis of normal distribution are often rejected with large sample sizes even though the distribution is not much different from normality. On the other hand, with small samples the null hypothesis of normal distribution is more likely to be accepted than it probably should. This means we should use different critical values for rejecting the null hypothesis according to the sample size.
- Small sample sizes (n < 50). When the absolute z-score for either skewness or kurtosis is greater than 1.96 (or 95% confidence level), we can reject the null hypothesis and assume the sample distribution is non-normal.
- Medium sized sample (n = > 50 to <300). If the absolute z-score for either skewness or kurtosis is larger than 3.29 (or 95% confidence level) we can reject the null hypothesis and decide the sample distribution is non-normal.
- Large sample size (n > 300). Here we can use the absolute values of skewness and kurtosis without consulting the z-value. An absolute skew value of greater than 2 or a kurtosis (proper) above 7 (4 for excess kurtosis) would indicate the sample distribution is non-normal.
9. What to do when your data in non-normal:
When you discover that your data is non-normal the automatic response is to seek a nonparametric test. Before going down this route it’s important to consider that a number of the tests that assume normality are in fact not sensitive to normality. This includes t-tests (1-sample, 2-sample and paired t-tests) which are often used by A/B testing software, Analysis of Variance (ANOVA), Regression and Design of Experiments (DoE).
You should also consider why your data is non-normal. Some causes of data not being normal include:
- Large samples will often diverge from normality.
- Outliers or mixed distributions are present.
- Skewness is observed in the data.
- The underlying distribution is non-normal
- A low discrimination gauge is used with too few discernible categories.
Tests which are not robust to non-normality include:
- Capability Analysis and determining Cpk and Ppk.
- Tolerance Intervals
- Acceptance Sampling for variable data.
- Reliability Analysis for estimating high and low percentiles.
10. Non-Parametric Tests:
The most common non-parametric tests are Pearson’s chi-squared, Mann-Whitney U-test and Fisher’s exact tests. Pearson’s chi-squared is designed for categorical data (two or more categories) to determine if there is a significant difference between them. For examples, this would be useful to look at differences between new and returning visitors.
Chi-square tests work by comparing the difference between observed (i.e. the data from the test) and the expected values in each cell in a table. Expected values are estimated based upon the null-hypothesis being correct (i.e. that there is no difference between the control and test groups).
Fisher’s exact test is frequently applied instead of the chi-square test for small samples or when the categories are imbalanced. Unlike the chi-square test, Fisher’s exact test estimates the exact probability of observing the distribution seen in the table rather than expected values.
The Mann-Whitney U-test is a non-parametric test which requires continuous data. This means it’s necessary to be able to distinguish between values at the nth decimal place and data needs to be from an ordinal, interval or ratio scale of measurement. For example, a conversion rate is a ratio and so the Mann-Whitney U-test could be used to measure statistical significance.
11. Online Experiments (A/B Tests and MVTs):
With an online experiment, understanding the probability of an outcome is very low and accepting the null hypothesis is proven (i.e. there has been no significant improvement), indicates that one of these is true;
- There has been a real improvement in our variant compared to the default experience.
- There has been no real improvement and we recorded a rare outcome that happens by chance (e.g. at 95% statistical significance we would expect this to be observed one in twenty times).
- Our statistical model in incorrect.
When we are measuring changes in proportions, such as comparing two conversion rates, we can ignore 3 because we will be testing a classical binomial distribution. If we tried to test a continuous variable, such as revenue, this would be problematic because the distribution would clearly not be of a binomial nature.
Statistical significance ( e.g. 0.05 or 95%) is a measure of uncertainty. At a statistical confidence of 95% we know there is a one in twenty chance that what we observed was just a natural variation from the mean. This is not an error, but simply a rare event. The level of uncertainty you are willing to accept should be related to the risk of making a wrong decision (i.e. dismissing the null hypothesis when the null hypothesis is true).
The p-value you intend to use for your experiment should be agreed before proceeding with a test. Otherwise there is a danger that you will adjust the level of statistical confidence to obtain a positive outcome (i.e. dismiss the null hypothesis). It’s also important not to simply set a default level of statistical confidence (e.g. 95%) without considering the nature of the experiment and the risks involved.
The same is true for experiments relying on confidence intervals. They use the same logic as p-values and it’s good practice to look at both p-values and confidence intervals to understand the level of uncertainty an A/B or MVT generate.
As part of your preparations for any A/B test always estimate the required duration of an experiment. The VWO test duration calculator allows you to estimate the length of the test so that you won’t let your test run for longer than necessary.
This is important because it will improve your chances of getting a conclusive test result and help set expectations with our stakeholders. There is always an opportunity cost with running an A/B test because we could use the time and resources for something else (e.g, another experiment). If we fail to set the duration before an experiment begins there is a danger we will be subject to the sunk cost fallacy and allow the test to run for longer than it should.
12. The Null Hypothesis:
We have briefly mentioned the null hypothesis and how it is normally based upon the variant not being significantly above or worse than the control. This is what we call a superiority test, we want to ensure we don’t implement an experience we think is better than the control, when in fact it is not.
However, with A/B testing, we often just want to avoid implementing an experience which is significantly worse than the existing design. For example, we may believe the new experience offers improved usability or it better informs our users about a legal aspect of our service. We might not expect this to improve our conversion rate, but we will be prepared to implement it provided it doesn’t significantly reduce conversions.
This second type of null hypothesis is called a non-inferiority test. As a consequence of this type of test we might set the null hypothesis at around – 2% rather than zero with a superiority test. This means that a small uplift in conversion might become statistically significant with a non-inferiority test and yet it would have been dismissed as not significant had we set our null hypothesis at zero.
13. When to use Non-Inferiority Tests?
You now understand the importance of setting the right kind of null hypothesis. That doesn’t mean it is always appropriate to use a non-inferiority test. When the stakes are high or you have a senior stakeholder who is dead against implementing an experience you will probably have to use a superiority test. We have to evaluate each test separately and consider some of the following factors:
- High risk of even a small decline in conversion having a relatively large impact on revenues or conversions (e.g. ecommerce checkout or cashier in gaming).
- High implementation costs and difficult to roll back change.
- Significant ongoing maintenance and related costs to support the new experience.
Non-inferiority tests tend to be more appropriate when we don’t see these issues and so we should consider them when:
- Implementation costs are low
- Little if any maintenance costs
- Easy to roll back the change if it needs to be removed
- Very little opposition to the new experience from internal stakeholders.
- Low risk to revenues or other important conversions
This means that you can use non-inferiority tests for many small and trivial changes such as button colour, image changes, copy updates, live chat, and email capture forms. We don’t expect many small changes to improve conversions, and so it is totally appropriate to set the bar lower and use a non-inferiority test.
14. History of Standard Normal Distribution:
The French mathematician Abraham de Moivre (1667 to 1754) is credited with being the first to propose the central limit theorem which established the concept of the standard normal distribution. He was fascinated with gambling and was often employed by gamblers to estimate probabilities. Moivre discovered a bell shaped distribution whilst observing the probability of coin flips. It is suggested he was trying to identify a mathematical expression for the probability of x number of tails out of 100 coin flips.
This turned out to be a breakthrough in helping to discover that a large number of phenomena (e.g. height, weight and strength) approximately form a bell shaped distribution. Lambert Quetelet (1796 to 1874), a Belgium astronomer, first noticed a ratio between weight and height which became the basis for the body mass index (BMI).
The normal distribution has often been used in the measurement of errors. When it was first discovered it was used to analyse errors of measurement in astronomical observations. Galileo observed that errors were systematic and later it was noted that errors also follow the normal distribution. Today scientists and social science use the normal distribution extensively. In finance traders have tried to use the normal distribution to model asset prices. However, real life data rarely, if ever, follows a perfect normal distribution.
Google Optimize: How to set up and run experiments with Google Optimize.
Law of small numbers: Does the law of small numbers explain many myths?
A/B testing software – Which A/B testing tools should you choose?
Types of A/B tests – How to optimise your website’s performance using A/B testing.