minutes read

ML Model Testing: Refine Your ML Models, One Test At A Time

An introduction to different aspects of Machine Learning model testing

ML Model Testing: Refine Your ML Models, One Test At A Time
In this post:

What is Model Testing?

The widespread use of Machine Learning (ML) in domains as diverse as entertainment to healthcare has defined the evolution of applications. This has resulted in a demand for trustworthy ML applications tested for the rigors of the real world. Be it self-driving cars or diagnostic health tests, the need for predictable, robust, and efficient applications is a must.

Of course, it helps to gain the end-users trust if the application has been hammered and tinkered with to expose any future issues. ML model testing is the said hammering technique that tries to reveal any deviations of the observed model behavior from the expectations.

Testing is worth the cost and time since you would not want to dip both feet into the water without knowing the potential pitfalls.While conventional software testing has been around for decades, ML model testing is gaining ground of its own. For example, an automated white-box testing framework named DeepXplore caught thousands of anomalous handling by some autonomous driving systems on how they negotiated corners. Another popular technique, Themis, caught bias against a race, marital status, and gender in ML algorithms.

Why is ML Model Testing Important?

Machine learning systems are driven by statistics and are expected to make independent decisions. Systems that churn out valid decisions need to be tested for the demands of the target environment and user expectations. Good ML testing strategies aim to reveal any potential issues with design, model selection, and programming to ensure reliable functioning. While rife with challenges, ML testing also offers these advantages:

  • ML is a data-driven programming domain where model behavior depends on the training-testing data. Data inconsistencies can also be exposed through ML testing where the issues with data may include
  • Presence of noise 
  • Biased or incorrect labels
  • Skew between the training and test data
  • Presence of poisoned data
  • Incorrect assumption of post-deployment data 
  • ML testing can uncover issues that may arise from an ill-planned framework, especially if scalability requirements were ignored.
  • The core of the ML system, the learning program, could either be a legacy part of the framework or code written by your team. The learning programs that realize, deploy, and configure the ML system could also contain defects that can be caught by model testing.

How is Model Testing Different from Application Testing? 

Some major points of difference between the testing of conventional software and ML models are

  • Focus of the test: Application testing is generally focused on the code, while model testing also aims to uncover potential issues due to data and the learning algorithm.
  • Expected behavior during tests: The test suites for conventional software are driven by the expected outputs of the program that are consistent over time. Conversely, the behavior of an ML model is tested against different dynamics.
  • Test inputs: The inputs for application testing are generally data to simulate different conditions. For model testing, test inputs may include data as well as another learning program. 
  • Test oracle: The oracle to verify test outputs against expected behavior is pre-defined for traditional testing. For model testing, on the other hand, the definition of oracles is time-consuming and tricky, even for domain-specific problems. We will explain what is a test oracle in further sections.
  • Acceptance criteria: The criteria for adequacy of application tests are uniform across the industry and follow metrics like a line or flow coverage. The logic representation of ML models cannot be gauged against such criteria. 
  • Probability of false positives: ML testing has been found to report higher instances of false positives when compared to application testing.
  • Executor: The potential defects in ML modeling is not limited to the code and include the choice of learning algorithm and datasets. Thereby, the tester's role in ML testing could involve data scientists or framework designers in addition to the developers.

Summarizing the above points for you:

A table showcasing the difference between Application testing and ML model testing
A table showcasing the difference between Application testing and ML model testing

How to Test ML Models?

Before taking you through various model testing strategies, let us show you where it figures in the scheme of things.

The role of model testing in the system development flow
The role of model testing in the system development flow. Source: Machine Learning Testing: Survey, Landscapes and Horizons

As seen from the image, an ideal model testing scenario would include offline as well as online testing. But what goes on under the hood? Let us break apart the two hexagons and see the underlying processes.

The components of ideal offline and online model testing
The components of ideal offline and online model testing. Source: Machine Learning Testing: Survey, Landscapes and Horizons

Offline testing

The various components of offline testing include:

  • The initial step of requirement gathering, which also defines the testing procedure.
  • Test inputs could be samples extracted from the training dataset or synthetic data.
  • Let us now introduce you to the Oracle Problem. ML systems are developed to be oracles, i.e., answer the questions that do not have existing answers. After all, why develop, train, and test a system if the correct answer was already available? Test oracles are methods that decide if a deviation is an issue in the ML system. Common methods like model evaluation and cross-referencing are used in this step.
  • The test execution may then be done on a subset of training or testing data and check for test oracles violations.
  • In case of any violations, the generated reports would help the team locate and address the issues.
  • Regression tests can be used to validate the debugging and fixing of the issues.
  • The successful cycles of offline tests culminate into a model fit for deployment.

Online testing

Online testing comes to the fore when the ML system is exposed to new data and user behavior after the deployment. Depending on the purpose, it may include the following components:

  • A common method is runtime monitoring, where the collected performance metrics are checked for violations.
  • Monitoring user responses is another common method, and A/B testing is a widely used technique to achieve this. It is particularly useful to compare two versions of the system and test the efficacy of model improvements. You may also like to read this extensive piece on how to conduct A/B testing in machine learning.
  • The findings of A/B testing can be used for Multi-Armed Bandit (MAB) to choose the best candidate model.

Challenges in Model Testing

Some common issues that plague model testing:

  • While it sounds simple, defining and evaluating the test oracle is a minefield of controllability and observability issues. 
  • The expected behavior of an ML system can be best evaluated when it is viewed on the whole. This abstracted view of the system can hinder testing strategies where breaking the system into components and unit testing them helps isolate issues at a finer level.
  • This purview can move testing challenges from the component level to the system level. For instance, the effects of an inefficient library would translate to a low precision value of the model. This composite view can slow down debugging and delay catching the issue. 
  • Since most ML systems are black-box or grey-box, errors may amplify downstream the development pipeline. 

Best Practices to Test Your Models

In school, you are taught a lesson and then given a test. In life, you are tested, and it teaches you a lesson.

We would like to wrap up this post with some practices that can help you avoid hard lessons: 

  • Limiting to a single metric is not enough to see the bigger picture. Behavioral tests where specific performance aspects are evaluated using targeted tests hold a better potential to uncover system issues. Additionally, bug reports by targeted tests will help faster debugging.
  • Not every improvement measured by the metrics is a success. Before you green signal that update, evaluate the extent of improvement when testing the enhancements.
  • An initial sanity check using a smaller or familiar dataset can help sniff out early bugs. Some industry practitioners either use a small percentage of the training data or a commonly used dataset with known outputs and check if the model could overfit them. The inability of the model to overfit the expected behavior is an early indication of bugs.
  • Use tools for data validation and configuration management like Flask, Pydantic, or FastAPI. This practice should minimize issues due to data and settings inconsistencies.
  • Analogous to conventional programs, ML models are prone to semantic bugs too. Visualizations and interactive dashboards can help uncover deviant behavior by the models. The plots generated by the ML code can be reloaded after fixes to get quick feedback. A powerful visualization library like Streamlit can be well-utilized for this purpose.
  • Automate the testing and monitoring functions when possible
  • While the CI/CD aspect of MLOps ensures relevant data and code in the repositories, automated CI/CD platforms would ensure testing of each change and smooth collaboration among teams. Here you can learn more about CI/CD in the paradigm of an ML Lifecycle.
  • Many tools are available to automate unit testing for machine learning and checks on code coverage like Functionize,, and Appvance.
  • Automated monitoring such as provided by the Censius AI Observability Platform would help your team stay a step ahead with customizable monitors and alerts. All of your online testing needs will be taken care of with minimal effort.
  • Testing is vital for any application. Write unit tests for non-learning code, such as data pre-processing or augmentation functions.
  • As shared by Krittin Kalra, founder of Writecream, A/B testing and multivariate testing are the most preferred methods of testing models. A/B testing is the most preferred method because it is the best way to quickly test models and see which one is better. Multivariate testing is not as effective as A/B testing but is still a good method to test models.

To provide you with easily digestible tidbits of information, we also send out a newsletter that you can sign up for in the form given below.


The author would like to acknowledge the insightful community discussions that helped frame best practices for model testing:

Liked the content? You'll love our emails!

The best MLOps and AI Observability content handpicked and delivered to your email twice a month

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Censius AI Monitoring Platform
Automate ML Model Monitoring

Explore how Censius helps you monitor, analyze and explain your ML models

Explore Platform

Censius automates model monitoring

so that you can 

boost healthcare

improve models

scale businesses

detect frauds

boost healthcare

improve models

scale businesses

detect frauds

boost healthcare

Start Monitoring