by Chasz Griego, Open Science Postdoctoral Associate
Jupyter & Google Colab
Are you writing Python code in a Jupyter Notebook and looking for ways to easily share data and results that accompany your notebook? Are you teaching with jupyter notebooks and having trouble distributing your teaching material? Maybe you want to collaborate on one notebook as a team? Or maybe, you want to publish and disseminate notebooks with reproducible results? If you answered "yes" to any of these questions, I recommend you continue reading. I share my findings from a search for literate programming platforms that promote collaborative and reproducible computational work for both scientific research and educational material.
Literate programming platforms, like Jupyter notebooks, are useful tools to communicate computational science as they let you annotate your findings between code cells, using markdown cells to format text. These are also effective for teaching because a notebook can act as lecture notes that let students follow as you teach while writing their own code in their own copies of the notebook. However, there is one major challenge to using Jupyter notebooks in the classroom. The entire class needs to learn how to install python, Jupyter, and any libraries you plan to use.This issue lessens when using Jupyter notebooks through Google Colaboratory (i.e., Colab). Colab is a free literate programming platform from Google Research with free access to computing resources (including GPUs). Colab notebooks are similar to Google Docs or Sheets, in that they are saved in your Google Drive and shareable with a link. With Colab, there is no worry about installing Python or Jupyter, and many common libraries are already installed (here's a list and how to access the list from a Colab notebook). When it comes to sharing your notebooks and collaborating, this platform is definitely a step up from your basic Jupyter platforms. However, there are some drawbacks to Colab. While some features mirror Google Docs or Sheets, you can't collaborate on a notebook in real time. For sharing a project, you have to make sure all of the necessary data files are shared with your notebooks. And finally, you don't have as much freedom to build and share a custom Python environment. If you are planning to disseminate accessible and reproducible notebooks and hope to avoid the limitations of Jupyter and Colab, read below for a short list summarizing some platforms that harbor collaborative and reproducible literate programming. I want to preface this is nowhere near a comprehensive list, rather this only highlights a handful of free platforms for academic users.
Kaggle is a free online platform and community that publicly hosts code and data from many users. Kaggle is well known for hosting competitions, typically centered around a data set, where participants can compete to build the best machine learning models that may, for instance, predict passengers that survived the Titanic shipwreck. Kaggle may be viewed more as a place for data enthusiasts to work on independent/hobby projects, but the platform makes it easy for anyone to upload/find datasets, share analyses in notebooks, and preserve coding environments. You can find interesting datasets and notebooks from other users, and if you are curious about a particular analysis, you can easily create a copy from the project page and start experimenting. The notebook interface is also very user-friendly and offers several nice features. You can write code in either R or Python and save up to 20 GBs of output from a notebook like images or data files. You can also share notebooks with any other user on Kaggle, where they can view or edit the content. Your programming environment is all saved to your project, so any user that accesses your notebook will be able to reproduce your results. Like Colab, your notebook environment is already built with the most recent versions of libraries, and Kaggle ensures that these libraries are updated every other week. You can also customize your environment further by installing packages from within your notebook (using pip, for instance), and if you wish to replicate your environment elsewhere, Kaggle writes Dockerfiles for each project.
Deep Note is an excellent platform for collaboratively creating literate programming projects. In Deep Note, users have workspaces that store notebooks, data, and any other files within separate project folders. The notebook interface in Deep Note presents a lower learning curve for users that are new to literate programming. As you add cells in your notebook (which are called blocks in Deep Note), the platform offers many straightforward choices to navigate you through your workflow, which is more defined than the typical, open-ended choice between a code or markdown cell. Speaking of which, you don't need to know markdown syntax to create formatted text blocks. Deep Note provides options for a paragraph, heading, list, etc. In addition to Python code blocks, you can select SQL blocks to run queries and see the results displayed in the notebook as a Pandas DataFrame. From there, you can select a chart block that opens a simple toolbox to produce visualizations from a DataFrame. And rest assured, you can still do everything manually like customizing markdown cells or creating plots with Matplotlib. In Deep Note, you can also build your environment with a Docker image, either by choosing a file locally or pointing to a hosted Docker image. Many up-to-date libraries are automatically included here as well, and if you install new libraries from your notebook, they are recorded in a file (requirements.txt) that is read from an initialization script (init.ipynb) each time you launch your notebook session. Finally, one of the nicest features of Deep Note, in my opinion, is real-time collaboration. Very much like the apps in the Google Workspace, you can invite collaborators to your Deep Note project, grant edit access, and allow others to write comments on individual code blocks or edit a notebook at the same time as you. Overall, Deep Note is great for facilitating teamwork among individuals with varying levels of expertise, so I'd highly recommend this platform for a classroom setting.
The final platform in this list is my personal favorite, Code Ocean. Code Ocean creates an excellent foundation for accessible and reproducible computational research material. With this platform, project materials are contained in "capsules." These capsules contain four compartments dedicated to metadata, data, code, and environment. In the metadata compartment, authors can edit the details that describe their capsule, especially if the data belongs to a publication. The data compartment contains all data files that are read from your code, and the code compartment contains your scripts or notebooks. You can write code in many different languages, including Python, R, Matlab, C/C++, Perl, Julia, and Java, and you can launch cloud stations to do your work in Jupyter Lab or RStudio, for instance. And for the environment compartment, you will find a Dockerfile that updates as you install libraries or make other changes to your environment. Contrary to the other platforms mentioned, you start with a clean slate, and your newly created capsules don’t have pre-installed libraries. Instead, you tell Code Ocean how you want your environment, with directions to specify programming language, libraries, and versions of libraries. (The system will default to the latest version if you make no specification). Much like Kaggle, you can browse through hundreds of public capsules to find an analysis that interests you. From there, you can either perform a reproducible run, which executes the code and reproduces the exact results originally produced by the authors, or you can copy the capsule to make your own modifications and take a deep dive. You can also share a capsule with collaborators, granting them access to view and/or edit the project. Multiple collaborators may work in a capsule at once, but when someone starts editing a component, that component will be put into view mode for the other participants. While Code Ocean may not have features for live collaboration, it is gaining much recognition as a platform to host computational research materials that accompany scientific publications. Several journals have partnered with Code Ocean, encouraging authors to prepare digital research material in a capsule. Some of these partners are Nature Methods, Nature Biotechnology, Nature Machine Intelligence, BMC Bioinformatics, Scientific Data, and Genome Biology.
This completes my brief list of useful platforms for collaborative and reproducible literate programming. Thank you for reading, and I hope you were able to learn something new about the places you can go to analyze data in notebooks!