Statistical significance tests: A statistical way to compare data populations (2024)

We often come across the need to compare samples of two data sets to derive some conclusions about their similarity. Some common examples include assessing similarity of groups of users based on their purchase history, assessing similarity of servers based on resource utilization, or assessing similarity of applications based on their usage behavior, among others.

But how does one measure similarity? Let’s find out!

To present the concepts, we will compare three pairs of data sets. These pairs are created such that one pair is very similar, another pair is partially similar, and the third pair is very dissimilar. Through these three pairs, you will see how to measure the similarity of such sets.

Data summary stats

One of the most common ways to measure similarity oftwosets is to compare their data summary via mean and median. Figure 1 shows two graphsthatcompare the means and medians of thethreepairs ofdata sets respectively. As we know, thethreepairs have different levels of similarity but that is not exactly reflected in their mean or median values. Here, the data spread around their mean and median seems very similar.Using these graphs, you cannot separatethe mean and medianofthe very similar pairs from the dissimilar pairs. By just looking at the graphs, you wouldobservethat all thethreepairs are very similar, but you know that is not the case! Thus, only measuring the similarity based on mean and/or median can be deceptive.

Statistical significance tests: A statistical way to compare data populations (1)

Visualization tools

Another commonly used method to measure data similarity is using various visualization tools such as box plots, histograms, andcumulative distribution functions(CDFs).

Box plotsprovide a graphical representation of the data by rendering minimum, first quartile, median, third quartile and maximum values. Figures 2 compares the box plot of the three pairs of data sets. Unlike the mean or median, the box plots provide a much more informed view of the data dissimilarities.

Statistical significance tests

While visualization tools are a very expressive way to understand the data, they seldom can be scaled to large data sets. Given hundreds of data sets, comparing them by visualizing charts is an impractical approach.Hence, there is an imperative need to have some mathematical formulations to measure the similarity ofdata sets.Also, note that in many scenariosthedataof the entire population is either unknown or difficult to obtain. In such cases, one needs to work with a sampleofthepopulation.This is where statistical significance testscanhelp.

Student’s t-test

One of the popular statistical significance testsis theStudent’s t-test.An independent two-sample t-test is used to compare the means of two independent groups.The t-value is computed as follows:

Different types of t-test

There are different variations of the Student’s t-test. For simplicity, we limited the discussion above to the independent two-sample t-test. Here are three different variations of this test:

Paired sample t-test:This test is used when the two sets are coming from the same population, e.g., compare the CPU utilization of Windows servers before and after a patch upgrade.
Independent two-sample t-test:This test is used when the two sets are coming from two different population, e.g., compare the CPU utilization of Linux and Windows servers.
One-sample t-test:This test is used when a set is compared against some standard value, e.g., compare the CPU utilization of Window servers with the standard acceptable limit of 80%

The example presented in Figure 5 refers to the independent two-sample t-test. This test works under the assumption that there is no relationship between the two groups, the data in each group is approximately normally distributed and the two groups are obtained via random sampling from the population.

Hypothesis testing

Statistical significance tests belong to the wider class of tests called hypothesis testing. Hypothesis testing is a statistical method used to determine if a certain hypothesis is true or not based on the available sample data.

First, state your null and alternate hypotheses. The null hypothesis assumes that there is no significant difference between the two samples that are being compared. The alternate hypothesis believes that there is a significant difference.
Next, use an appropriate statistical significance test to determine the test statistic score.
With the help of the test score, determine a probability score associated with the results obtained under the null hypothesis, known as the p-value. The p-values indicates how likely it is that similarity will occur by chance and not because of any significant cause.
Set a significance threshold (alpha), which will act like a threshold for the p-value. If the p-value isgreater than alpha, then we do not reject the null hypothesis and assume that there is no significant difference between the two samples.

Other commonly used statistical significance tests

Z-test:The Z-test is similar to the Student’s t-test that helps to compare means of two groups.It is used when we have large observations.
Chi-squared test:The Chi-squared test is kind of statistical hypothesis test that is used to compare observed results with expected results.The Chi-squared test is ideally used when we want to compare categorical data. Examples of categorical data include gender, education qualification, occupation, etc.
Anova test:The Anova test is also known as analysis of variance. It is helpful for testing three or more variables.Similar to multiple two-sample t-test, an Anova test allows you to compare more than two groups at the same time to determine the relationship between them.
KS test:The KS test is used to detect more complex patterns compared – to the t-test, because the t-test and many other classical statistical tests has limitations like normality in data, which means the data must form a bell curve when plotted.

Conclusion

In this blog we explored how to define similarity between two groups. We often face such scenarios even in our day-to-day lives. For example, in clinical trials you can use these tests to determine similarities between different medicines. These tests can also be used in businesses to determine whether a new marketing campaign increased sales or not.

Statistical significance tests or hypothesis testing are some of the most important topics in statistics that are widely used in real-world scenarios and across many sectors such as finance, industries, healthcare, research, and economics. They are even used to determine the strength of the predictions of machine learning algorithms. Thus, statistical significance tests have a very significant usage in real life.

Written by Uday Chandra