What is Version Control?
Version control is the process of maintaining records or snapshots of past instances of software code, tools, platforms, data, or any other digital asset. Software teams leverage version control tools to track changes in the code or record the evolution of their development processes. Developers can easily roll back to more successful iterations of their development process with version control or even ward off errors without causing major disruptions.
Version control is specifically beneficial for team collaboration since it tracks changes made by individual contributors and allows modifications and smooth reversal of changes without any disruption to the workspace of other contributors.
Why do we need Version Control?
Version control wasn't always the norm. In fact, it was created as a solution to the increasing resistance that emerged when large groups of engineers wanted to collaborate. Version Control came up alongside the evolution of open-source software (OSS), where engineers across the globe had to collaborate to create OSS solutions that could manage multiple changes without any major disruption to the source code. Version control has seen its own share of evolution. Initially, version control was centralized which meant that the version data was stored in the local domain or servers, making it challenging for developers to operate on any external network.
As a solution to this, distributed version control was established where no single machine was responsible for maintaining the version data. Instead, a host of distributed machines had a copy of the data and the entire version history, making it possible for developers across various networks and locations to successfully sync with the core source code. Today, version control is the norm across all application development processes.
How does machine learning benefit from version control?
Version control has been a phenomenal catalyst in the growth of software applications and is estimated to be an even more significant contributor to machine learning (ML) growth. Unlike software development, the typical ML pipeline is littered with experimental loops, which can get messy if the tracking methods are not vigilant. Here are a few ways through which version control continues to contribute to high-performing ML projects:
- Selecting the best iteration: Out of thousands of different routes that an ML project can lead to, the top-performing route is chosen for production. For this, the developer has to revert to the exact iteration that resulted in the highest scores. It is critical to note that even a slight change in any ML route can send the results off track, which is one of the biggest disadvantages of manual or non-automated tracking. This is where version control comes in.
Through a version control tool, developers can not only access the metadata of the most rewarding iteration but also re-run similar iterations to compare and analyze why certain routes are more rewarding. Such metadata helps to reproduce the exact datasets, hyperparameter combinations, model sequences, feature sets, and any other detail that can bring back the same performance scores.
- Handling disruptions: A machine learning pipeline can be long and take weeks to ideate and build. One bug in the pipeline or failure in a module can wipe out potentially high-performing results. Moreover, failures in the production environment can be fatalistic to the end-customer experience and disrupt customer relations.
To avoid this, version control tools can be leveraged to record and version every stage in the pipeline so that the developer can pick up from where the pipeline started to underperform. A root-cause analysis will lead to the misleading module in the pipeline, and the developer can simply mark it and roll back the data and the pipeline vitals to the previous stage. This also enables the solution to simply roll back to a less disruptive iteration until the issues in the latest iteration are resolved.
- Better organization and governance: Any developer or data scientist will attest to the fact that the number of iterations in any ML project can get overwhelming within weeks after initiation. Just half a decade ago, when most tracking data, dependencies, and results in startups and PoC projects were manual, amateur teams faced the consequences of losing top results under a pile of under-managed experiments.
Version control tools allow automating most of the tracking and administrative tasks around experiment management. Simple features of such tools can instantly highlight the top-performing assets and supply the details to exactly reproduce any experiment out of thousands. Such high transparency also makes it easier to set up and maintain AI/ML governance.
- Low impact on production: ML solutions are extremely dynamic and undergo constant changes. This keeps the production or deployment team constantly busy since they have to roll out minor to major changes every other week. Therefore, ML solutions have a high risk of underperformance and significant downtime when failures or bugs pop up in the source code updates.
Version control helps to track the old and new changes such that the best versions of the new features can roll out in planned phases that minimize downtime and create a low impact on the model performance. In case of any failure, the developer can simply roll back to a more stable iteration in minimum time. This also helps undo unstable changes on priority when the code needs to go live in production at the earliest. Version control benefits Continuous Integration and Continuous Development (CI/CD) by bringing stability in phased production with the ability to roll back on failures.
Now that we have covered some of the high-level benefits of version control, let’s look into some of the ways that can help you to set up version control for your own machine learning projects.
What are the Different Ways of Executing Version Control?
Version control can be carried out through two broad methods or types of version control.
- Distributed Version Control System (DVCS)
DVCS is suited for smaller and light projects that can sit on a developer’s local system along with its entire version history and source code files. The advantage of this system is that it allows the developer to disconnect from a central system and work in an isolated environment outside the internal network. The changes made by the developer can later be merged to the common codebase, which the other members of the team can pull from.
DVCS automatically calls for extra storage requirements since a copy of the source code is replicated in any system that needs to make changes. One key benefit of this system is that the distributed environment reduces the risk of losing code due to faults in one system. An example of DVCS is Git, one of the most widely adopted version control platforms among startups and teams with small-scale projects.
- Centralized Version Control System (CVCS)
CVCS is suited for larger projects that cannot be loaded entirely on local systems. The entire source code and version history are located in a single centralized storage space, which is more feasible in terms of storage optimization, even though less reliable. CVCS is especially useful if several teams need to work on a common project since they can refer to a single source of the file system that is space-efficient and easy to locate.
CVCS has possible disadvantages like team-wide outages, slow network, and constant connectivity. However, it serves as a much simpler way of storing and updating versions than DVCS.
Some common version control best practices that can be implemented in your ML projects are below:
- Repositories - Different repositories should be created for different experiments and different models. This keeps the version history for each experiment clean and helps future contributors easily make out the steps.
- Branches - For evaluation, it is best to create separate branches for different pipeline details, such as a feature or a hyperparameter. To gather all the evaluated features, an integration branch is ideal and can be further used to analyze the model on the feature set.
- Descriptive updates - Each commit/merge should be well labeled and described so that the version history is detailed and legible to other users. In fact, descriptive labels are useful for future reference even to the developer updating the code.
Tools for Version Control
One good practice that is often overlooked is to closely analyze and understand your internal infrastructure to align it with a version control tool that matches your requirements. Version control tools allow development teams to keep track of the source code, and versions of assets such as the data, model, and feature sets. The version control tool must also fit well with your existing ecosystem of MLOps tools if any. Below are some options to choose from:
Git is one of the most widely adopted open-source version control platforms. It is apt for small-scale projects and teams. It has great UI and high readability making it a user-friendly tool even for beginners in the team. However, Git lacks some enterprise-level features that make it a partially suitable option despite wide adoption. Several extensions can be added to Git to customize accordingly; however, the core platform doesn’t allow some functionalities such as data tracking or dependency management.
Recommended Reading: Why is DVC Better Than Git and Git-LFS in Machine Learning Reproducibility
Among one of the Git extensions mentioned above, DVC or Data Version Control is one of the most efficient and popular choices among data version control tools. DVC as a data version tracker acts as a good add-on to Git, which solely manages code versions. It has the ability to integrate with any type of storage and streamlines data files and binary models into a single Git environment. DVC can also track metrics and failed iterations. It can be automated through DAGs that connect every module in the ML pipeline. DVC offers extensive documentation, which serves as great user support.
Recommended Reading: DVC vs MLflow
Pachyderm is a version control tool with both open source and enterprise editions. It offers a range of features for data version control, pipeline management, and governance. Some such features include enabling support for a range of file types, centralized commits, continuous updates in the master branch, and maintaining a history of updates from contributors for reproducing results.
- Sandbox environments
A sandbox environment is an isolated environment that enables updates and experiments outside the production environment. One such sandbox-type environment is Jupyter which allows code to be separated and logs details such as time and results in a visually appealing way. However, even though Jupyter notebooks can version code initially in the experiment, it is not good for long experiment runs or projects with multiple experiments since they cannot track minute changes, and the version histories could get easily overwritten.
lakeFS is an open-source version control tool similar to Git in terms of committing and isolated branching features. However, it comes with the added ability to utilize storage extensions such as S3 and GCS to scale to Petabytes of data. lakeFS offers isolated environments to analyze and experiment with data segments while the rest of the data is unaffected. Additionally, it supports continuous integration and continuous deployment of data to optimize data pipelines and bring high fault tolerance.
The importance of version control in machine learning cannot be denied. The industry has come a long way compared to a few years ago when model and data versioning were foreign concepts to corporate teams who were just beginning to dabble with ML projects. On the other hand, the academic sphere has been much more vigilant on versioning since they have experimented with ML for decades. Versioning not only helps to retain best performing results but also stabilizes failures by fault-proofing through rollback options. Moreover, due to the detailed audits of edit history, compliance and governance tasks also benefit.
Versioning as a task is not complex and can be managed with minimal resources. However, the challenge arrives when the data volume and quantity of models go way above the manageable limit. This is when the simple task of logging history and versioning faces the risk of getting distorted due to multiple collaborators, experiments, and solutions. This ends up costing more resources than initially estimated. This is where regular and enterprise-quality tools come in to simply plug in with your ML projects to take care of version control with optimized resources.