Member-only story
Machine Learning, Software Engineering
Data Versioning for Efficient Workflows with MLFlow and LakeFS
Building resilient, atomic and versioned data lake operations
Introduction
Version Control Systems, such as Git, are essential tools for versioning and archiving source code. Version Control helps you keep track of the changes in the code. When a change is made, an error could be introduced, too, but with source control tools, developers can roll back to a working state and compare it against the non-working piece of code. This minimizes the disruption to other team members that are probably working with the code and helps them collaborate efficiently.
Apart from code, data changes too.
Usually, Data Scientists need to access a range of datasets to complete a specific task. From feature engineering to model training or selection and hyper-parameter optimization, data gets processed and changes, too. On top of that, the experimentation and observation of results is also a prevalent task in the day-to-day work of a Data Scientist, which means that they need to switch back or forth a specific data format or version. As you can imagine, this makes the whole process time-consuming and error-prone.
In this article, we will explore lakeFS — a powerful tool that helps teams manage data the same way they manage code. We will discuss how it tackles data versioning and how it can potentially help data teams build more efficient workflows which are also less prone to errors. Additionally, we will see how it can be used along with mlflow
to build a workflow that versions both the data and trained Machine Learning models.
What is lakeFS
LakeFS is an open-source platform offering a Git-like model that helps teams version and manage data. From complex ETL tasks to data science and analytics, LakeFS provides resilience, atomicity, and manageability over your data lakes. It works seamlessly with all modern data frameworks such as Apache Kafka, Apache…