Scott Stevenson

Jupyter notebooks and collaboration

Git has seen widespread adoption to become the de facto standard for sharing and collaborating on code, and the same is true of Jupyter notebooks as the environment for doing interactive data exploration and modelling. However, herein lies a problem: Git was designed to version plain text files containing source code, and not for storing structured data such as the JSON source of Jupyter notebooks and binary data such as embedded images.

Without extra tooling and processes, this makes following best practices–such as making small, self-contained patches on topic branches and submitting them for code review–difficult, and the output messy. In this talk at PyCon UK, I demonstrate the tools and practices that make working with Jupyter notebooks for machine learning more collaborative, more productive, and more fun.

I show how to use built-in Git features, such as incremental staging of changed files, to avoid introducing noise from changed cell counts, before showing how simple tooling can allow us to automatically clear output cells from notebooks before committing new changes to Git, avoiding adding binary data to our repository.

Finally, I introduce the nbdime tools from the Jupyter project: a set of tools for diffing and merging Jupyter notebooks. I demonstrate how to install them and how to configure Git to use them, to achieve integration between Jupyter and version control for collaborative machine learning research.