Towards AI

The leading AI community and content platform focused on making AI accessible to all. Check out our new course platform: https://academy.towardsai.net/courses/beginner-to-advanced-llm-dev

Follow publication

Member-only story

Machine Learning, Software Engineering

Data Versioning for Efficient Workflows with MLFlow and LakeFS

Building resilient, atomic and versioned data lake operations

Giorgos Myrianthous
Towards AI
Published in
11 min readApr 29, 2021

--

Photo by Hannes Egler on Unsplash

Introduction

Version Control Systems, such as Git, are essential tools for versioning and archiving source code. Version Control helps you keep track of the changes in the code. When a change is made, an error could be introduced, too, but with source control tools, developers can roll back to a working state and compare it against the non-working piece of code. This minimizes the disruption to other team members that are probably working with the code and helps them collaborate efficiently.

Apart from code, data changes too.

Usually, Data Scientists need to access a range of datasets to complete a specific task. From feature engineering to model training or selection and hyper-parameter optimization, data gets processed and changes, too. On top of that, the experimentation and observation of results is also a prevalent task in the day-to-day work of a Data Scientist, which means that they need to switch back or forth a specific data format or version. As you can imagine, this makes the whole process time-consuming and error-prone.

In this article, we will explore lakeFS — a powerful tool that helps teams manage data the same way they manage code. We will discuss how it tackles data versioning and how it can potentially help data teams build more efficient workflows which are also less prone to errors. Additionally, we will see how it can be used along with mlflow to build a workflow that versions both the data and trained Machine Learning models.

What is lakeFS

LakeFS is an open-source platform offering a Git-like model that helps teams version and manage data. From complex ETL tasks to data science and analytics, LakeFS provides resilience, atomicity, and manageability over your data lakes. It works seamlessly with all modern data frameworks such as Apache Kafka, Apache…

--

--

Published in Towards AI

The leading AI community and content platform focused on making AI accessible to all. Check out our new course platform: https://academy.towardsai.net/courses/beginner-to-advanced-llm-dev

Written by Giorgos Myrianthous

I strive to build data-intensive systems that are not only functional, but also scalable, cost effective and maintainable over the long term.

Responses (1)