The Need for Model Serving
Every ML Engineer aims to train high accuracy models to add value to their parent organization. However, training an ML model constitutes merely 20% of the ML lifecycle. The most challenging aspect of the ML pipeline is the deployment of the model in the production environment. The environment where the model is deployed is very different from where it is trained. Consequently, the deployment phase is known to spawn major real-world complexities.
Over 87% of models are never successfully deployed to production. Complexities associated with production include:
- Bulk Issues- Large datasets used to train the model can be of the order of a thousand Gigabytes. This makes them extremely importable, especially if the data quality is poor. Scalability is another major issue if the model size is larger than the established assets. Moreover, large configuration codebases functioning as dependencies for the ML codebase occupy a lot of space unnecessarily.
- Code Complexities- Multi-platform experimentation degrades the code quality, rendering it unsuitable for production. Poor engineering practices might render the configuration codebase even larger than the ML codebase. Multiple languages like Python, Scala, and SQL are used consistently, further corrupting the codebase.
- Data drifts and real-world biases- Real-world data is dynamic. Static models are unfit to deal with the dynamics of drifting data samples. A model constructed on a given data sample may be rendered useless if the data’s productivity fluctuates.
- Priority Identification- Corporates wish to run resilient ML models performing exceptionally well even under diverse landscapes. Building and deploying a model with such concrete computational capabilities would, in turn, incur massive expenditures. The business landscape requires the highest-performing models but at minimized rates. ML engineers need to classify their priorities before they deploy the model.
All the aforementioned problems in production could be solved by introducing model serving. Model Serving is the workflow that involves deploying a model as a web service. Other services can communicate with the model and use its predictions for decision-making. Model serving simplifies large-scale deployment and enhances scalability. It enables multiple models to be deployed simultaneously. Model serving also renders the billing more cost-effective. One of the best model serving tools is Seldon.
Model Serving Strategies
There are three main model serving strategies:
Offline serving or batch inference generates predictions for a batch of observations. These predictions are then cached in a powerful database, from where they are served to the end-users. The end-users do not interact directly with the model. If the end users belong to the same organization, the need for a high-performance database is replaced by packets of data and metadata(S3 objects/blobs) that point users to the designated files. Naturally, offline serving is disadvantageous due to being a cold deployment workflow. This means that it results in longer lead times.
Client-and-server architectures are susceptible to network attacks. Edge deployment seeks to take strides into serverless computation by serving the model directly on the end user’s device(on the edge), thus curtailing the risks of network compromise. Edge deployment is a relatively newer deployment technology that is gradually making headway in Apple’s Core ML and Google’s TensorFlow JS. However, more improvement is needed to substitute traditional model serving methodologies. Latent network configurations catalyzed by edge deployment may cause the deployment to fail without prior notice. It is very difficult to replicate the production landscape in the development phase for testing purposes. Unexpected downtimes and lack of clarity in successful product deployment are a few reasons why edge deployment is not preferred.
Model as a Service Deployment
The most common method of model serving is deployment as a microservice. An API endpoint of the model is served to the client to initiate POST/GET requests. This is a highly efficient practice for developing fully flexible, responsive, and scalable model services. Models are considered to be online if they can accept user inputs and train themselves automatically. Online model serving algorithms are more responsive and handle concept drifts remarkably. However, most DevOps teams prefer to deploy their microservices on a Kubernetes cluster. Kubernetes allows customers immense scalability, minimalistic downtimes, and resilience. The most popular model serving tool for Kubernetes -based model serving is Seldon.
Seldon Core is an open-source model serving tool leveraging Kubernetes to deploy ML models in production at scale. It is an MLOps tool that offers a diverse spectrum of advanced functionalities like A/B testing, Outlier detections, and Advanced metrics. These help in transforming the model files to APIs or microservices. Seldon automates the ML pipeline by packaging only the updated version of the trained model into the suitable microservice and relaying it to the end-user. There is no change required in the configuration codebase. Seldon’s robust architecture offers speedy and efficient deployment of thousands of models in the production setting. Here are a few of the solutions Seldon Core provides to deal with production complexities:
- Independent Platform
Seldon Core is available both on cloud and on-premise. Popular Kubernetes services like Azure AKS, Digital Ocean, AWS EKS, and Google Cloud’s GKE support Seldon Core for all ML/DL libraries and toolkits, including PyTorch and TensorFlow. Machine learning models are containerized within language wrappers, reusable and non-reusable inference servers.
- Advanced Inference Graphs
Seldon Core supports complex inference graphs festooned with routers, combiners, ensembles, and transformers for deductive/predictive reasoning.
- Metrics and Audits
Seldon features an elastic search workflow to audit all input-output requests to/from the model. Seldon Core provides a fully-scalable suite of services for monitoring and logging the performance metrics of the model. Seldon offers advanced integration with monitoring and observability platforms for rich analyses. Seldon’s monitoring capabilities trigger alerting workflows to detect and troubleshoot discrepancies. Model-explaining platforms can also be integrated with Seldon Core for providing quality insights on model implementations.
- Open Tracing
Seldon Core provides an infrastructure to detect the model’s performance and interpolate average latency metrics. In-built transparency allows Seldon Core to trace all API calls. Each model can be traced back to its training data. Simplistic code can be used to roll out updated models without re-writing complex legacy code.
Recommended Reading: Seldon Core Tutorial
Integrating Seldon Core with MLOps Infrastructure
Seldon is a model serving platform. It integrates itself with other MLOps components like the model registry for cataloging and managing models, a feature store for streamlining data from warehouses, and the experimentation hub for testing models in a notebook environment before deployment.
Seldon Core is an open-source model serving initiative that seeks to deploy ML models in the production environment using Kubernetes as a medium. Seldon Core offers solutions to problems posed by real-world aberrations in production. Data portability and transmission, configuration code dependencies, data drifts, and cost-effectivity issues are resolved using model serving platforms like Seldon. The model can then be served in the production environment via packaging as APIs or Microservices using Seldon Core.