Introducing MLflow and DVC
MLflow is a framework that plays an essential role in any end-to-end machine learning lifecycle. It helps to track your ML experiments, including tracking your models, model parameters, datasets, and hyperparameters and reproducing them when needed. MLflow provides a packaging format for reproducible runs on any platform and then sends models to your choice of deployment tools.
You can also record runs, organize them into experiments, and log additional data using the MLflow Tracking API and UI.
Recommended Reading: How to use MLflow to Track and Structure ML Projects?
It has various valuable components while monitoring processes like model training, storing the models, loading them to the production, and creating a pipeline.
These components include:
- Tracking: This allows you to track experiments to record and compare parameters and results.
- Models: This allows you to manage and deploy models from various ML libraries to various model serving and inference platforms.
- Projects: This allows you to package ML code in a reusable, reproducible form to share with other data scientists or transfer to production.
- Model Registry: This allows you to centralize a model store for managing models’ entire lifecycle stage transitions: from staging to production, with capabilities for versioning and annotating.
- Model Serving: This allows you to host MLflow Models as REST endpoints.
Data Version Control(DVC) is an open-source version control system used in machine learning projects. It is also known as Git for ML. It deals with data versions rather than code versions. DVC helps you to deal with large models and data files that cannot be handled using Git. It allows you to store information about different versions of your data to track the ML data properly and access your model’s performance later. You can define a remote repository to push your data and models, granting easy collaboration across team members.
To get the desired result, users do not have to manually remember which data model uses which dataset and what actions were conducted; this is all handled by DVC. It consists of a bundle of tools and processes that track changing versions of data and collections of previous data. The DVC repositories contain the files that are under the effect of the version control system. A classified state is maintained for each change that is committed to any data file.
DVC consists of a bundle of tools and processes that track changing versions of data and collections of previous data.
MLflow and DVC usage
The machine learning project lifecycle has improved itself with time. Previously, the main focus was on enhancing the prediction power of ML algorithms, but now developers also pay attention to managing their ML project lifecycle effectively. This includes sharing them outside the data science teams where the users are not the same as those who developed it, assuring the reproducibility of results and reducing the gap between data scientists and the operations team.
So, who can handle all this at once? MLflow is the answer. This MLOps tool adds significance to your ML lifecycle. It provides a dynamic way to simplify and expand the deployment of machine learning models by tracking, reproducing, managing, and deploying models in software development. Packaging the models irrespective of the framework and programming language used is taken care of using MLflow. Model management and experient tracking can effectively be handled using MLflow.
It brings transparency and standardization to the table when training, tuning and deploying your ML models. Following are the components of MLflow that are playing their part in the ML process:
- MLflow Tracking: It enables teams to record and query a clear overview of the performance of the training runs as all models are stored centrally in an experiment.
- MLflow Projects: It enables teams to quickly reproduce model versions with their corresponding code on any platform.
- MLflow Model Registry: It enables teams to manage ML model versions, modify lifecycle stages and deploy models to production.
- MLflow Models: It does the packaging of models in a standard format to be served as an endpoint through a REST API. The MLflow models are also compatible with the essential ML libraries.
Recommended Reading: MLflow Best Practices
Data Version Control (DVC)
Data tracking is a necessary thing for any data science workflow. Still, it becomes difficult for data scientists to manage and track the datasets. So there is a need for data versioning, which can be achieved using DVC. DVC is one of the convenient tools that can be used for your data science projects. Here are some of the reasons to use DVC:
- It enables ML models to be reproducible and share results among the team.
- It helps to manage the complexity of ML pipelines so that you can train the same model repeatedly.
- It allows teams to maintain version files for referencing ML models and their results quickly.
- It has the full power of Git branches.
- Sometimes, team members get confused if the datasets are incorrectly labeled according to the convention; DVC helps label datasets properly.
- Users can work on desktops, laptops with GPUs, and cloud resources if they need more memory.
- It aims to exclude the need for spreadsheets, tools, and ad hoc scripts to share documents for communication.
- You use push/pull commands to move consistent bundles of ML models, data, and code into production, remote machines, or a colleague's computer.
Weighing the pros and cons
Following are the advantages of MLflow:
- It is easy to set up a model tracking mechanism in MLflow.
- It offers very intuitive APIs for serving.
- It provides data collection, data preparation, model training, and taking the model to production.
- It provides standardized components for each ML lifecycle stage, easing the development of ML applications.
- It can easily integrate with the most popular tools that data scientists use.
- You can deploy MLflow models to various existing tools, such as Amazon SageMaker, Microsoft’s Azure ML, and Kubernetes.
- It helps us save the model along with the parameters and analysis.
- MLflow models give a standard format for machine learning model packaging.
Following are some of the disadvantages of MLflow:
- You can’t easily share experiments nor collaborate on them.
- MLflow does not have a multi-user environment.
- Role-based access is not present.
- It lacks advanced security features.
- The addition of extra workings to the models is not automatic.
- It is not easy and ideal for deploying models to different platforms.
Data Version Control(DVC)
Following are the advantages of Data Version Control:
- Along with data versioning, DVC also allows model and pipeline tracking.
- With DVC, you don't need to rebuild previous models or data modeling techniques to achieve the same past state of results.
- It is easy to learn.
- You can share your models via cloud storage, making it easier for teams to perform experiments and optimize the utilization of the shared resources.
- It becomes easy to work with tons of models and data metrics as DVC reduces the effort to differentiate which model was trained with which data version.
- It maintains distinguished reports in rotation.
- DVC works with local files, so it solves the problem of file naming for multiple versions.
Recommended Reading: The importance of Version Control in ML
- DVC has a tight coupling with pipeline management, which means if the team is already using another data pipeline tool, there will be redundancy in maintaining data.
- DVC is lightweight, which means your team might need to manually develop extra features to make it easy to use.
- There is a risk of incorrect pipeline configuration in DVC if your team forgets to add the output file.
- Checking for missing dependencies in DVC is quite challenging.
How best to use MLflow and DVC?
We had already discussed various points about the features, pros, and cons of DVC and MLflow, now the question arises, what is the best way to use them. So, DVC and MLflow are not mutually exclusive. DVC is used for datasets, while MLflow is used for ML lifecycle tracking.
The flow goes like this; you use the data coming from the MLflow Git repository along with the code, and then you initialize the local repository with Git and DVC. It will track your data set. On the other hand, Git will follow the data set that DVC produces, then push the dataset to remote storage. And if you want to access the executive version of the data with your code, you can use the DVC API. And you will track the details about the dataset along with the metrics of our model with MLflow. You can use them together to achieve the reproducibility of your project as a whole.
Recommended Reading: Why is DVC Better Than Git and Git-LFS in Machine Learning Reproducibility
Here are some more tips for working with MLflow,
Be sure to make MLflow logging optional by building a simple logging switch into your code. This will avoid putting a load of incomplete or empty runs in your MLflow project when debugging. Every MLflow run captures the git hash to keep the code version tracked. However, it would be best to commit all code updates before tracking an experiment to ensure consistency.