Machine Learning
 minutes read

Caught Off-Guard In The Production? Top Reasons Why Your Model Failed

Our take on what are the common reasons that cause poor performance by ML models in production and how to tackle them.

Caught Off-Guard In The Production? Top Reasons Why Your Model Failed
In this post:

It is a campfire story for those involved with ML model development. Irrespective of your role in the ML development lifecycle, the biggest bogeyman is the model's failure in production. The model fulfilled the metrics during development and passed the validation with flying colors. Yet, in the field, it fell face-first into the hard ground of a live environment.

If you pass around the flashlight and let teams explain why their model had failed, they will offer numerous reasons. The reasons may include not considering the volatility of data collected in the live environment, not considering the variety of user demography, or even thinking that deployment was the endgame. 

Spoiler alert: Deployment is not the endgame. It is a beginning.

An AI global adoption trends report by International Data Corporation (IDC) in 2019 found that among 2,473 respondent organizations, 25% reported AI failure rates for half of the projects.  The State of Development and Operations of AI Applications studied by DotScience in the same year found that despite 63.2% of respondents spending amounts as high as $500,000 to $10 million, 60.6% of them experienced different operational challenges.

The challenges that ML development experiences are different from conventional software and are influenced by the dynamics of the trio

  • The data: The need for maintenance and storage of data used for training during development and re-training after the model deployment.
  • The model: The need for re-training and changing ML algorithm in some cases.
  • The code: The need for maintenance and enhancements similar to conventional software.
The three factors that contribute to challenges faced by a Machine learning process
The three factors that contribute to challenges faced by a Machine learning process. Source: The author.

The common reasons ML models fail in production

Let us now pick each of the above three factors and the common machine learning problems  associated with relevant machine learning steps. Fear not, we have also analyzed possible countermeasures that can help you avoid them.

Failure due to the data

You may have developed a cutting-edge algorithm, but the importance of data quality for a successful ML project cannot be undermined. Therefore, organizations dedicate efforts and capital to ensure that quality datasets set foot in the ML pipelines. The following issues may cause failure downstream of the pipeline.

Issues with data collection

While acquiring data and understanding its structure is a challenge itself, its processing is equally a major task. Take the example of the single responsibility principle used by Twitter architecture. The multiple services that processed Twitter user data each had a single responsibility to facilitate scalability. Yet, at a large scale, it could become difficult to track which service stored the data in what form. This may result in the storage of data that may not be parseable, or worse, a failure in saving it to the relevant databases.

Issues with data preprocessing

The preprocessing stage involves data cleaning, reduction and transformations, and effective methods like feature engineering. You may read our take on how feature engineering aids ML development.

A relatively lesser-known issue of data dispersion can occur during this stage.

An ML project might source data from different sources, and each would have different dictionaries and schema. Therefore data integration is a sensitive process. A historical instance of potential failure arising from data dispersion is the Firebird advisory system. The system that aimed to identify target areas for fire inspections had to be trained and tested on 12 different datasets. Since spatial information for the buildings was in different formats and names, data joining turned out to be a painfully time-consuming process.

Issues with data augmentation

The most intensive task of augmentation is data labeling. Factors like high volumes of data, especially if collected in real-time, the absence of experts, and high variance can inhibit the availability of labeled datasets.

Issues with data analysis

A research survey of data scientists working for Microsoft in 2017 revealed that data issues could raise concerns for the overall quality of the project. Data profiling and its visualization are therefore required to discover discrepancies in the data and assumptions verification.

The solutions to a failure caused by the data

  • In data labeling, the contribution of the human experts, is directly dependent on the user interface of annotation tools. The choice of applications should therefore be focused on usability.
  • Use of tools that can visualize data profiling can gauge data quality and reveal discrepancies early.
  • Preprocessing should ensure that the dataset is balanced for the target variables.
  • It is crucial to know how to validate data for ML models in production.

Failure due to the model

The practical reasons that can cause failure in this respect are

Issues with model selection

The gusto of the team to develop an awe-inspiring solution can lead to the selection of complex models. Take the example of applying deep learning to AirBnB search, where the team started with a complex deep learning model. The black-box nature of neural networks overwhelmed the team and cost multiple development cycles. Thus causing many failed deployment attempts.

Issues with model training

An overfitted model or a model trained on an unbalanced dataset will fail when retrained on data collected from the target environment.

Issues with hyper-parameter selection

Hyper-parameter optimization techniques are also computation-heavy, more so if used for deep learning models. Additionally, the requirements set by the development environment should reflect the target environment. For instance, real-life wireless networks work on constrained energy sources and memory. A development environment not in sync with such restrictions is a gateway to failure in production.

Issues with model verification

The choice of performance metrics is important for post-deployment success. Taking a leaf out of instance, it can be claimed that improvement in model performance does not always translate to business gains. The deployment of about 150 models could not convert the proxy metric of clicks to the desired business output. The conversion of clicks to sales or cancellations and customer service tickets were overlooked for model verification.

The solutions to a failure caused by the model

  • A safe approach for model selection is to start with a simple model. A proof of concept can be developed and tested for the target solution. This will additionally expose setup issues in the early phase of the project. The feedback and learnings of the proof of concept can then be applied to more complex models.
  • Regularization can help in guarding your model against possible overfitting.
  • The cost of the model training procedure is governed by the training dataset size and the number of model parameters. This cost should be weighed when designing scalable systems.
  • The development environment should reflect the resources and capabilities of the production environment.

Failure due to the code

The core of ML development is coding, and therefore ML development faces challenges similar to those experienced by conventional software.

Issues with integration

The model integration gives it the final form for consumption by the users. Configuration debt might arise since ML systems require configuration setups as software, as well as ML-specific settings.

Additionally, further enhancements are needed with post-deployment feedback. ML code is just a part of the whole bundle, and supportive code is required to accommodate the data for use by ML code. This practice often results in messy glue code. Also, the incremental addition of new sources to the model can lead to code full of joins, thus causing a pipeline jungle.

Such undesirable side effects would result in code that is prone to bugs and unsuitable for collaboration.

Issues with initial deployment

While monitoring your newly deployed model is a good practice, it needs careful consideration to bear fruits. The failure to understand the key metrics for performance evaluation and triggers can lead to issues going undetected for a long time. Moreover, a feedback loop to adjust model behavior over time poses a significant challenge in case of inadvertent changes in model behavior.

Issues with code update

An ML model keeps up with the dynamics of the data and workings of the production environment through updates. While CI/CD in the paradigm of an ML lifecycle has borrowed all the strengths of DevOps, the problem of drift is typical to ML projects. The changes in production data can alter the model’s generalization, thus called concept drift.

The solutions to a failure caused by the model

  • The practice of code reuse borrowed from DevOps can save on efforts and infrastructure. Teams at Pinterest migrated to using universal embeddings for different models. This reuse resulted in simpler deployment pipelines. 
  • Some preprocessing libraries like Tensorflow Transform can help avoid glue code.
  • Adoption of a clean slate approach can help avoid a pipeline jungle. The pipeline can be developed afresh after the model development is stable.
  • Using sophisticated services of the likes of the Censius AI observability platform can allow you to evaluate model performance and seek explanations. Thereby getting equipped with the knowledge to make more informed decisions.
  • The monitors configured on observability platforms also help teams detect drifts.

Failure due to other factors

Issues due to bias

The training data might contain hidden biases that the model could learn. For instance, some facial recognition systems were biased due to an imbalance of skin color in the training images. Apart from the infringement of ethics, such oversight can lead to misuse of the model in sensitive domains.

Issues due to privacy non-compliance

ML models are driving recommenders and decision-making. Furthermore, they are integral to critical domains like healthcare and finance. With regulations like GDPR, HIPAA, and CCPA ensuring the privacy of individual data, non-compliance may result in your project running into legal tangles. 

The solutions to a failure caused by the ethics or legal non-compliance

  • Compliance is necessary not just for ethical acceptance but also to gain the end user’s trust. 
  • Guidelines for quality assurance, and compliance can be set to guard against the above issues.


The deployment of an ML model is the start of a journey rife with many pitfalls. The trio of influencing factors: the data, the model, and the code govern the quality of your ML project. In this blog, we discussed the various challenges the three aspects can throw at you and what are the best solutions for each.

To provide you with easily digestible tidbits of information, we also send out a newsletter that you can sign up for in the form given below.

Liked the content? You'll love our emails!

The best MLOps and AI Observability content handpicked and delivered to your email twice a month

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Censius AI Monitoring Platform
Automate ML Model Monitoring

Explore how Censius helps you monitor, analyze and explain your ML models

Explore Platform

Censius automates model monitoring

so that you can 

boost healthcare

improve models

scale businesses

detect frauds

boost healthcare

improve models

scale businesses

detect frauds

boost healthcare

Start Monitoring