Replicate Version control for machine learning

A machine learning model is the combination of the code and the training data, so knowing what data you trained a model with is essential.

There are two ways to do this:

  1. Store training data in Replicate. This is recommended if your training data is small (<100MB).
  2. Point at data in another system. This is recommended if your training data is large or you already have somewhere to store it.

Store training data in Replicate

If your training data is small, then we recommend storing it with Replicate in each experiment.

For example, if your training data is in a directory training-data/ alongside your training code, then you might write this in your code:

experiment = replicate.init(
path="training-data/",
params={...}
)

If you want to store both your training script and training data, you can just save everything:

experiment = replicate.init(
path=".",
params={...}
)

Then, to copy this data back to the current directory, you use replicate checkout:

$ replicate checkout <experiment ID>

The downside of this approach is that Replicate makes a complete copy of your training data on each experiment. So, this approach only works if your training data is small.

How small is "small" depends on your storage costs and bandwidth, but typically we'd recommend doing this if your data is less than 100MB.

Point at data in another system

If your training data is large, or you already have a system for storing your training data, then we recommend putting a pointer to your training data in the params dictionary.

For example, if your training data is on S3, you might put the URL to your training data in params:

training_data_url = "s3://hooli-training-data/hotdogs-2020-05-03.tar.gz"
experiment = replicate.init(
path=".",
params={
"training_data_url": training_data_url
}
)
# ... download training_data_url and run training

This assumes you are disciplined about versioning your data and the contents of that URL never change. If the data at this URL might change, then you might want to calculate the shasum and record this in params.

Then, if the data changes, you will see a different shasum in replicate diff, and you will know an experiment was trained on different data.

Note: This documentation is incomplete. We'd love to hear about ways that you are versioning data. See this GitHub issue or chat to us in Discord.

Let’s build this together

Everyone uses version control for software, but it’s much less common in machine learning.

This causes all sorts of problems: people are manually keeping track of things in spreadsheets, model weights are scattered on S3, and results can’t be reproduced. Somebody who wrote a model has left the team? Bad luck – nothing’s written down and you’ve probably got to start from scratch.

So why isn’t everyone using Git? Git doesn’t work well with machine learning. It can’t handle large files, it can’t handle key/value metadata like metrics, and it can’t record information automatically from inside a training script. There are some solutions for these things, but they feel like band-aids.

We spent a year talking to people in the ML community about this, and this is what we found out:

  • We need a native version control system for ML. It’s sufficiently different to normal software that we can’t just put band-aids on existing systems.
  • It needs to be small, easy to use, and extensible. We found people struggling to migrate to “AI Platforms”. We believe tools should do one thing well and combine with other tools to produce the system you need.
  • It needs to be open source. There are a number of proprietary solutions, but something so foundational needs to be built by and for the ML community.

We need your help to make this a reality. If you’ve built this for yourself, or are just interested in this problem, join us to help build a better system for everyone.

Join our Discord chat  or  Get involved on GitHub


Sign up for occasional email updates about the project and the community:

Core team

Ben Firshman

Product at Docker, creator of Docker Compose.

Andreas Jansson

ML infrastructure and research at Spotify.

We also built arXiv Vanity, which lets you read arXiv papers as responsive web pages.

Replicate Version control for machine learning

```