What is DVC?
Iterative.ai launched Data Version Control or DVC as a Git-based version control solution that follows the Commercial Open Source Software (COSS) model.
DVC tool simplifies data version controlling for ML projects with agility, data versioning, reproducibility, and sharing efficiency. This experimentation tool streamlines organizing and accessing big data efficiently. Its flawless Git-like experience and full code-data provenance help track the complete evolution of each ML model.
How Does DVC Help?
DVC addresses the following ML experiment challenges:
- Ensures consistency of all files and metrics to reproduce the experiment or apply it as a baseline for a new iteration
- Uses Git to keep metafiles, making version control of data sets and models easier. DVC supports several external storage options as a remote cache for large files.
- Establishes norms and processes for effective team collaboration and code sharing efficiency.
- Intends to replace traditional document sharing tools- Excel or Google Docs and ad-hoc scripts used for model version tracking and management.
Key Features of DVC
Storage and language agnostic
DVC allows Microsoft Azure Blob Storage, Amazon S3, Google Cloud Storage and Drive, SSH/SFTP, HTTP, or disk to store data. It allows defining pipelines using R, Python, Notebooks, Scala Spark, TensorFlow, PyTorch, etc.
Reproducible
DVC assures reproducibility by consistently maintaining the configuration, input data, and the code used to run an experiment. With a single ‘dvc repro’ command, users can reproduce experiments end-to-end.
Metric tracking
DVC streamlines managing experiments with Git tags/branches and metrics to pick the best version and track the progress of experiments.
ML pipeline framework
DVC offers a built-in way to connect ML steps into a DAG and execute the end-to-end pipeline corresponding to data cleaning, loading, feature engineering, and training.
Compatibility with Git
DVC runs on top of any Git repository and is compatible with GitHub or GitLab. It provides all the advantages of a distributed version control system.
- Lock-free
- Local branching
- Versioning.