Machine Learning
 minutes read

How To Conduct A/B Testing In Machine Learning?

A step-by-step guide to help you A/B test your ML models

Harshil Patel
How To Conduct A/B Testing In Machine Learning?
In this post:

When working for a product-based, eCommerce, or media company, you might be unsatisfied with the engagement numbers. You might want to evaluate how customers would respond if you increased the price or changed the user interface. Many people often believe that they know their customers, but things hardly turn out the way they expected. A/B testing is a way in which businesses test multiple features simultaneously to see which yields the best results. We'll look at what A/B testing is and how to perform it in this post.

What is A/B Testing?

A/B testing is a statistical approach for comparing two or more versions/features to evaluate not only which one works better but also if the difference is statistically significant.

A/B testing can be used for a variety of purposes, including:

  • Refine the messaging and design of marketing campaigns
  • Increase conversion rates by improving the user experience
  • Consider user involvement while optimizing assets such as web pages, ads, etc
A/B test example
A/B test example | Image by Author 

Why is A/B testing important?

When doing an experiment or an A/B test, you may discover something new, and the results might be rather humbling.  Companies frequently face the problem of thinking they understand their customers, but in reality, customers behave far differently than you may assume.   As a result, it's essential to conduct tests rather than depending on intuitions.

The issue is far more complicated and fluid.

  • All users are not the same: varied ages, genders, new vs. returning, and so on.
  • Users spend varying lengths of time on the website. Some people visit the site immediately, while others take their time.
  • Users follow many paths. They navigate the website, visiting various pages before facing the event and objective.
  • Modeling an A/B test in this environment can often lead to a misinterpretation of the genuine narrative.

Benefits of A/B testing:

  • Rapid Iteration
  • Data-driven decision 
  • Improved user engagement
  • Increased revenue & conversions
  • Uses actual users to perform tests

What is A/B Testing in Machine Learning?

Using the A/B testing approach, machine learning models may be evaluated and improved. The approach may be used to see if a new model is better than one already exists. The organization should choose a metric to compare the control and new models for this purpose. This metric is used to assess deployment success and differentiate between the two. Both models must be applied to a sample of data simultaneously for a predetermined period. Half of the users should use the control model, and the new model by the other half.

Performing A/B test

Let’s see the step-by-step process to understand how to perform the A/B test.


Setting a goal for the experiment is the first stage. What do you believe will happen if you upgrade to version B?  Maybe you are thinking of increasing:

  • Conversion rate
  • Product signups
  • User engagement and so on.

In simple terms, it's like outlining the test's goal or what you're hoping to achieve by the end.


You'll need a pool of subjects once you've established your criteria. These might be a group of users or clients. You might not be able to conduct A/B testing if you don't have enough subjects. For example, the dots in the figure below reflect the number of subjects.


We will assign the subjects into two different groups, A and B. It doesn’t have to be a 50-50 split. It can be 60-40 or 70-30. You need to figure out the split you need to run the A/B test. In this experiment, you'll also need to determine which population you're aiming for—for example, a user who searches, a user who visits, etc.

Now, you need to define a sample size. The general formula is:

N = 16σ²/δ² 


σ is the sample standard deviation. 

δ Is the difference between the control and treatment. 

After you've decided on the sample size, you'll need to figure out the duration of the experiment. Usually, the duration is about 1-2 weeks. You should experiment for at least a week to see how users interact with the product throughout the week and on weekends. Finally, put the experiment to the test.

 In the graphic above, we split the subjects into two groups
 In the graphic above, we split the subjects into two groups | Image by Author


This step will expose the subject to options A and B, measure the results, and calculate the test statistic. In the above example, we divided the subjects into two groups. Green dots indicate a subject's conversion rate; so, A received 70% conversion, and B received 40% conversion; thus, we now have our results.

Hypothesis Testing

Now, we'll see if the observed change is statistically significant. Hypothesis testing is a statistical methodology that involves deriving conclusions about a statistical parameter or risk distribution using data from a sample. Now, let's take the above example again. 

Hypothesis testing can be summarized into four steps:

  1. State the hypothesis statements.
  2. Set the significance level.
  3. Set the statistical power.
  4. Set the minimum detectable effect.
A/B test observed results
A/B test observed results | Image by author

The values will be far higher in the actual world; this is just for illustration. We got a 70% user conversion rate in A and 40% in B. Let's look at our test statistic, which we'll use to determine whether or not there is a real difference between A and B.

Test Statistic: (A-B)% = 70-40% = 30%

So, this is our observed difference; the question now is whether or not it is statistically significant? So, to answer this issue, we must first determine if the 30 percent difference is due to a real difference between A and B or whether it is just due to random chance. This is where hypothesis testing comes in.

Any observed difference between A and B, as a result, is related to either:

  • Null hypothesis(Ho): Random chance
  • Alternative hypothesis(Ha): Real difference.

You can see how the test works in the graphic below; in our case, A was more significant than B, indicating that the experiment was better than B.

Hypothesis Observation
Hypothesis Observation | Image by author

We'll now look at the significance level. The significance level is basically the decision threshold; a lower significance level indicates an underlying difference between the baseline and the control.

The P-Value is the probability that the difference between two values is related to random chance. The P-value refutes the null hypothesis. The lower the p-value, the more likely Ho is to be discarded. As a result, what you saw did not happen randomly. In the majority of cases, the alpha value is about 0.05.

Alpha = 0.05;  the p-value is less than 0.05, reject Ho and Conclude Ha.

Now, we will set the statistical power, which is the probability of detecting an effect if the alternative hypothesis is true. It is usually set at 0.80

Finally, you have to set the minimum detectable effect (MDE). This means that if the change is at least 1% higher in revenue per day per user, then it is practically significant. In this article, we will not be able to cover all aspects of hypothesis testing. So, I'll provide suggestions.

Statistical Significance Tests for Comparing Machine Learning Algorithms

Hypothesis testing

Hypothesis Test for Comparing Machine Learning Algorithms

Validity Checks

We'll examine the experiment's sanity in this step. A faulty experiment might lead to a poor decision. You might search for external factors such as instrumentation influence, selection bias, etc. For example, if you experimented on a holiday or during a period of economic instability, you may make poor decisions.


The final step is to make a choice based on the outcomes of your experiment. This can be used to upgrade a version/feature.

When to do the A/B Test in ML?

A/B testing is a strategy for determining how a change in one variable impacts audience or user engagement. It's a systematic strategy for improving campaigns and target conversion rates in marketing, web design, product development, and user experience design. You can perform A/B testing, if:

  • You want to compare which product performs better
  • To identify which soil type supports better seed germination in agriculture
  • To see which experiment generated the most user involvement in product and sales
  • Setting price for a product, which one yields high profits or which one leads to more new customers

Let’s take a real-world example;

  • Bing conducted an A/B test in which they changed the way ad headlines were shown in the Bing search engine.
  • This little experiment resulted in a revenue gain of 12% or more than $100 million per year in the United States alone.

A/B testing is ineffective when testing large changes, such as new goods, new branding, or altogether new user experiences. There may be impacts that promote higher-than-normal engagement or emotional responses that cause people to behave differently in certain situations.

Common A/B Testing Mistakes You Should Avoid

When interacting with other professionals in an organization, there is a potential that certain concepts will be misunderstood. As a data scientist, you might want to educate or help others understand how to handle data properly. Let's take a look at some of the most common A/B testing mistakes:

Incorrect hypothesis: The entire experiment is based on the hypothesis. What has to be changed? What is the reason for the change? What is the intended effect? And so on. The likelihood of the test succeeding diminishes if you start with the incorrect hypothesis. Ensure the hypothesis' outcomes are correct before moving on to the next step.

Testing multiple elements simultaneously: This can happen when you run an A/B test with multiple metrics or one metric with various treatment groups. When you test too many things at once, it's tough to determine which one caused the success or failure. As a result, prioritizing tests is critical for successful A/B testing. 

To overcome this problem, you can separate all the metrics into three groups. First, those you expect to be impacted, then those which can be impacted, and finally, those which are unlikely to be affected.

Ignoring the Importance of Statistics: It makes no difference how you feel about the test. Allow the test to go through its whole course, regardless of whether it passes or fails so that it obtains statistical significance. Ignoring this could result in poor decision-making and product failure.

Not Validating: It's critical to double-check if the results are correct. A/B testing might be faulty if tests are run when there is a chance of getting incorrect results.


Companies will find it easy to run the test and use the data to improve user experience and performance.  A/B testing may be done using various technologies, but as a data scientist, you must understand the aspects that go into it. 

To validate the test and demonstrate its statistical significance, you must also be familiar with statistics. A/B testing can help you enhance your results in a variety of ways. I hope you enjoyed the article, happy experimenting.

Liked the content? You'll love our emails!

The best MLOps and AI Observability content handpicked and delivered to your email twice a month

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Censius AI Monitoring Platform
Automate ML Model Monitoring

Explore how Censius helps you monitor, analyze and explain your ML models

Explore Platform

Censius automates model monitoring

so that you can 

boost healthcare

improve models

scale businesses

detect frauds

boost healthcare

improve models

scale businesses

detect frauds

boost healthcare

Start Monitoring