Streamline ML Workflow with DVC & DVCLive

Mastering Data Versioning and Model Experimentation using DVC & DVCLive

ronilpatil
Towards AI

--

Photo by 夜 咔罗 on Unsplash

The journey may be complex, but each line of code carries the potential to transform ideas into reality.

Repository

Table of Contents:
Introduction
What is DVC & Why do we need it?
Create a project template using cookiecutter
Update the cookiecutter template
Git & DVC initialization
Understand params.yaml & dvc.yaml
Implement logger
Stage I– load data from Google Drive
Stage II– perform train-test split
Stage III– train the model
Stage IV– evaluate the model
The dvc.lock file
Stitch the stages
Run the experiments
Important key points
Conclusion

Introduction 🎬

In this blog, I’m going to cover a partial MLOps workflow(from project initialization to model experimentation). There could be a lot of steps in the e2e MLOps workflow. I’ll not cover all of them here, I’ll address them in subsequent blogs. So stay tuned!
Here I won’t go into great detail on ML, unusually it’ll make it lengthy. My major focus would be on data versioning, and model experimentation using DVC & DVClive. Grab your coffee, it’s time to dive into the good stuff!

What is DVC & Why do we need it? 📑

DVC stands for Data Version Control. It is a tool designed to manage the complexity of machine learning projects. It focuses on version-controlling large files such as datasets, and models. DVC is designed to work seamlessly with Git to manage both code and data in machine learning projects. Although Git is excellent for versioning and tracking changes in source code, it may struggle with large files efficiently. DVC integrates with Git to get over this bottleneck.

Git & DVC Collaboration

Git tracks changes and manages the evolution of our machine learning project’s codebase. DVC stores lightweight metafiles in the Git repository. These files contain information about the large datasets, models, and other binary files associated with the project. The actual data files are stored outside the Git repository, and DVC handles the versioning and linking of these large files. DVC uses Git for managing code and lightweight metadata, while the actual data files are stored efficiently in separate remote locations such as Google Drive, Amazon S3 Bucket, and Azure Blob Storage.

Create project template using cookiecutter 🌱

A cookie cutter is a template that helps us to create cookies of the same shape quickly. In the programming world, a Cookiecutter template is similar. It’s like a reusable mold for setting up new projects. Instead of starting a project from the beginning every time, we can use a Cookiecutter template to quickly create a basic structure with the files and folders that we need. It saves time and ensures consistency across our projects, just like a cookie cutter helps us to make uniform cookies 🍪.
Activate your virtual env, go to the terminal, and just execute the below command :

cookiecutter -c v1 https://github.com/drivendata/cookiecutter-data-science

It’ll ask for basic details about the project and that’s it. Attached snapshot for your reference.

Configuring cookiecutter template (Image by author)

This template creates an awesome folder structure. Take a look at the snapshot:

Cookie-cutter template (Image by author)

Update the cookiecutter template 💠

Based on our project requirements, we need to add a few more files & directories. Such as
params.yaml — to store all the parameters that we’ll use further in our project.
dvc.yaml — here we’ll write the e2e execution workflow in the form of stages.
temp dir — DVC will use it as local remote storage for storing our datasets and ML models. We can give any name to the dir.

Once you create the temporary directory, it’s essential to set it up as remote storage using DVC. This enables DVC to utilize it for storing datasets and models. Instead of relying solely on local storage, remote storage options like Google Drive, Amazon S3 Bucket, and Azure Blob Storage are also available. Just run the below command to configure local remote storage:
cmd: "dvc remote add -d any_name temp/"
Here,
any_name: The name you want to give to this remote. Replace it with a meaningful name.
temp/: The URL or path to the remote storage. In this case, it’s a local directory named temp/ located under the root directory.

Note — Look, the cookie-cutter template by default adds a data folder into .gitignore so that it won’t be a part of the version-controlled history. But we want to track it through git hence remove it from there.

And we don’t want to track temp folder(local remote storage for DVC) changes or include any content from that folder in the version control system therefore we will add temp/ to .gitignore.

# create yaml file for parameters
touch params.yaml

# create yaml file for writing stages
touch dvc.yaml

# create temp folder, can use any name instead of temp
mkdir temp

# my_remote is remote name used to identify the remote. Must be unique.
dvc remote add -d my_remote directory/path/

Git & DVC initialization 🔄

Once you’ve completed the preceding steps, let’s initialize Git and DVC.

git init     # initialize git
dvc init # intialize dvc

Now create a repo on GitHub and run the below command to link this local repo with the remote repo.

git remote add origin https://github.com/user_name/remote_repo_name.git
git branch -M main
git push -u origin main

How DVC works internally, it holds its own complexity, therefore I’ll cover it in another blog. So stay tuned!

Understand “params.yaml” & “dvc.yaml” 🕵️

Look, params.yaml is a simple configuration file containing parameters and settings that are used by the code to control various aspects of the machine-learning project. We may put hyperparameters, file paths, data sources, configuration flags, and experiment identifiers. These parameters can be easily modified without changing the code, providing flexibility and reproducibility in your experiments.

In dvc.yaml we’ll define the complete machine-learning workflow. Beginning from data fetching to model evaluation. First, we’ll divide the workflow into multiple stages and then start working on each of them individually and stitch them with each other so that they form a complete pipeline. I would suggest you please go through the documentation, they’ve explained each and everything very smoothly.

Note: Look, I’ve attached the params.yaml and dvc.yaml for your reference, don’t copy-paste the whole file at a time, and run the experiment. Otherwise, you won’t be able to understand the workflow or you may get errors. Don’t do that.

Let’s create stages 🔗

Implement logger

Look, logging contributes to better code maintainability, faster issue resolution, and overall project reliability. It helps in debugging, bug fixing troubleshooting, understanding the flow, performance monitoring, and many more. Because we are dealing with several files in this project, it will be difficult to trace down any errors or bugs, therefore it is best practice to log the workflow. Implemented a logger to log the workflow and save it at some location.

Below I’ve added a few snapshots of log files for quick reference.

Log of make_dataset.py (Image by author)
Log of train_model.py (Image by author)

Stage I– load data from Google Drive

Let’s implement code to load a dataset from Google Drive and save it in the data/raw dir. I’ve used the credit card dataset, you can download it from here save it in Google Drive, and make it public.

# params.yaml

base:
project: creditcard-project
target_col: Class

data_source:
drive: https://drive.google.com/...

load_dataset:
raw_data: /data/raw
file_name: creditcard
# dvc.yaml

stages:
load_dataset:
cmd: python ./src/data/load_dataset.py
deps:
- ./src/data/load_dataset.py
params:
- data_source.drive
- load_dataset.raw_data
- load_dataset.file_name
outs:
# way to read parameters from params.yaml
- .${load_dataset.raw_data}/${load_dataset.file_name}.csv

Keep adding the parameters in params.yaml and stages in dvc.yaml step by step. It’ll help you to understand the workflow and stitch the stages in a better way. Follow the below steps:
– First push the updated codebase on Git.
– Run dvc repro to execute the stage.
– After successful execution, push the updated dvc.lock , .gitignore file on Git. Please provide relevant commit messages.
– If any changes happen in datasets or models, track the changes through DVC using dvc push
Below I’ve added a snapshot of the output for quick reference.

Executing load_dataset stage (Image by author)

Additionally, provide a suitable commit statement so that you don’t get into trouble in the future.

Stage II– perform train-test split

Let’s implement code to fetch the data from data/raw dir, perform a train-test split on it, and store it in data/processed dir.

# params.yaml

# add below section in params.yaml
make_dataset:
test_split: 0.3
random_state: 42
processed_data: /data/processed
# dvc.yaml

# add below stage just after load_dataset stage under "stages:" \
# section of dvc.yaml
make_dataset:
cmd: python ./src/data/make_dataset.py
deps:
- ./src/data/make_dataset.py
- .${load_dataset.raw_data}/${load_dataset.file_name}.csv # depend on raw data dumped in previous stage
params:
- make_dataset.test_split
- make_dataset.random_state
- make_dataset.processed_data
- load_dataset.raw_data
- load_dataset.file_name
outs:
- .${make_dataset.processed_data}/train.csv
- .${make_dataset.processed_data}/test.csv

Again repeat the same steps:
– First push the updated codebase on Git
– Run dvc repro
– After successful execution, push the updated dvc.lock , .gitignore , and dvclive(if present) folder on Git. Add meaningful commit message.
– Track the changes through DVC using dvc push That’s it.
Below I’ve added a snapshot of the output for quick reference.

Executing make_dataset stage (Image by author)

Stage III– train the ml model

We extracted the data, split it, and now we’re all set to train the model. Let’s train the model and see the magic of DVC.

# params.yaml

# add below code in params.yaml
train_model:
seed: 42
n_estimators: 15
max_depth: 8
model_loc: /models
# dvc.yaml

# add below stage just after make_dataset stage under "stages:" \
# section of dvc.yaml
train_model:
cmd: python ./src/models/train_model.py
deps:
- .${make_dataset.processed_data}/train.csv # depend on train.csv &
- ./src/models/train_model.py
params:
- train_model.seed
- train_model.n_estimators
- train_model.max_depth
- train_model.model_loc
- make_dataset.processed_data
- base.target_col
outs:
- .${train_model.model_loc}/model.joblib

In train_model.py you’ll notice that I’ve used dvclive. Let me explain its magic:
dvclive allow us to track and record metrics of machine learning experiments. This includes metrics like accuracy, loss, precision, recall, confusion metrics, or any other custom metrics you want to monitor.
– It integrates seamlessly with DVC which ensures that the experiment metrics are stored and versioned along with your code and data.
– It provides real-time visualization of the metrics during the training or evaluation process. This can be useful for monitoring the progress of your model and identifying trends or issues as they occur.
– It offers a web-based interface or dashboard where we can view and analyze the recorded metrics. This can enhance the interpretability of our experiments.
– It also helps in maintaining consistent experiment logging practices across different runs, making it easier to compare results and understand the impact of changes to your code or data.

When we execute this code, it’ll store the logged data under the directory (dir) passed to Live(). If not provided, dvclive will be used by default.
Look, there are multiple methods available in dvclive package to track the parameters, metrics, or plots, here I’ve used 2 of them log_param() and log_metric() Added snapshot of dvclive for quick reference.

Folder Structure of dvclive (Image by author)

log_param() will create params.yaml file under dvclive folder and track all the parameters logged by log_param(). We can run dvc params diff command to compare the parameter changes that happened in the current workspace(non-committed) and the previous experiment(committed). Even we can compare the changes b/w any experiments using their resp commit ID, just use the command dvc params commit_id1 commit_id2 Once you commit the changes on Git, dvc params diff won’t give any output. It only compares change b/w previously committed experiment and recent non-committed experiment.

Parameters Difference (Image by author)

log_metric() will create metrics.json file under dvclive folder and track all metrics logged by log_metric(). Here we get some commands to compare the metrics:

# it will show metrics of curr experiment
dvc metrics show

# compare metrics of curr. exp and previous workspace
# only show metrics which get changed
dvc metrics diff

# compare metrics of different commits
dvc metrics commit_id1 commit_id2

# show all metrics irresp. of their changes
dvc metrics --all
Current Metrics (Image by author)
Metrics Difference (Image by author)

DVClive saves the parameters/metrics in yaml & json files along with .tsv files. params.yaml and metrics.json will store the latest logs while the .tsv files will store all the logs. It will keep appending data as we perform experiments. But yaml & json file will override the data after every experiment.

Note: We’ll track dvclive folder using Git, if we do so then parameters, metrics, datasets, models, and codebase will sync and be tracked parallelly. Look, Git is tracking our codebase, large files such as datasets, models, and binary files are tracked by DVC. DVC is also storing metadata of large files in the Git repository. Now the large files and codebase get linked via these metadata files. DVC will store information such as parameters and metrics of the model in the Git repository, now Git has all the information. Git knows the codebase, dataset location, model location, and their metrics and parameters.

Once we are done with the code, follow the above steps to track the changes in the codebase as well as the data.
Below I added a snapshot of the output for quick reference.

Executing train_model stage (Image by author)

Stage IV– evaluate the model

Let’s evaluate our model and see where it stands.

# dvc.yaml

# add below stage just after train_model stage under "stages:" \
# section of dvc.yaml
predict_model:
cmd: python ./src/models/predict_model.py
deps:
- ./src/models/predict_model.py
# my thought is, if train.csv changes then we'll run train_model bt if test.csv changes we'll do prediction
- .${make_dataset.processed_data}/test.csv
- .${train_model.model_loc}/model.joblib # depend on model.joblib generated by previous stage
params:
- train_model.model_loc
- make_dataset.processed_data
- base.target_col

We’re not using any external parameters hence no update in params.yaml. To push the changes on Git & DVC follow the above-mentioned steps.
Below I added a snapshot of the output for quick reference.

Executing predict_model stage (Image by author)

The “dvc.lock” file

DVC is also creating its own staging file and calling it dvc.lock. This file looks very similar to our dvc.yaml file, the only difference is after every file name it also stores its hashcode and the file size so that Git and DVC can sync together. I added one snapshot for quick reference:

dvc.lock & temp folder-remote storage (Image by author)

Look, dvc.lock file storing the hashcode and the same hashcode is tagged to a file. Every time any changes happen, DVC stores a copy of it at remote storage(here it's a temp folder). After every change/experiment we’re tracking dvc.lock file, now suppose you checkout any commit, it’ll go back to that snapshot of dvc.lock file and when we do dvc pull it will pull the same hashcode files that are mentioned at that snapshot of dvc.lock. This is how things work.

Stitch the stages 🪡

Look, here I added the actual code of dvc.yaml. In the stages: section you will notice that each stage is dependent (deps:) on output (outs:) of the previous stage. The stages are connected via deps & outs, this is the way we stitch the stages and build a complete workflow. While executing the experiment, if any changes happen in deps: then that stage will execute otherwise it’ll skip that stage. This is the beauty of DVC.

Run the experiments️ 🧪

Once we are done with the codebase, now we can easily change any parameter in params.yaml and execute the pipeline. Either compare the metrics through a terminal or click on DVC and select Show Experiments It will open the Experiments tab, where you’ll get all the information from parameters to the hash code(assigned by DVC) of the file. Below I’ve added a snapshot of the Experiments tab for quick ref.

Image by author

Just explore the panel, and you’ll learn super interesting things.

Important key points 🗝️

– I pushed the codebase on GitHub. Just fork it & start doing experiments with it.
– If you want to go back to any experiment, just simply checkout that commit "git checkout commit_id" and pull the changes "dvc pull" That’s it!
– If any changes happen in large files that are under DVC’s eye, DVC will create a separate copy of it, assign a hash code to it, and store it at remote storage. You may think it is utilizing more memory, don’t think about it!
– After every experiment, don’t forget to track the changes through Git and DVC.
– We can also assign a tag to each experiment or major update so that we can easily switch b/w nodes and pull the changes.

Conclusion 🤔

Let’s wrap up our exploration of data versioning and model experimentation, knowing that the journey of learning never ends. If this blog has sparked your curiosity or ignited new ideas, follow me on Medium, GitHub & connect on LinkedIn, and let’s keep the curiosity alive.

Your questions, feedback, and perspectives are not just welcomed but celebrated. Feel free to reach out with any queries or share your thoughts.

Thank you🙌 &,
Keep pushing boundaries🚀

--

--