Data Analysis with Docker

Peter Humburg

19th May 2015

Overview

Introduction
Building a Docker image for data analysis
Using the Docker image
Summary

Introduction

What is Docker?

Platform designed for developing, distributing and running applications.
A Docker image contains an application and all its dependencies.
Can be run anywhere with minimal setup and configuration.

What is Docker? (cont.)

Similar to a Virtual Machine but lightweight, more portable and efficient.
A new Docker image can be based on an existing one, inheriting all its content and functionality.
Multiple images can be linked to create more complex environments.

Using Docker

An application and its dependencies are bundled as a Docker image.
Docker images are distributed via DockerHub or other repositories.
Images are downloaded to host machines (local or in the cloud).
Running a Docker image produces a Docker container that contains the running application.
Multiple containers can be created from the same image.

How is this relevant for data analysis?

Provides stable environment during analysis
- Full control over all software versions used.
- Independent of host system.
Helps to make analysis reproducible
- Docker provides a way to distribute the complete environment used for the analysis.
- This can include all the software, a script to run the analysis and even the data.

Building a Docker image for data analysis

Differential expression / eQTL analysis

Want to create an image that

runs a specified analysis on start-up
generates a report and serves it as a web page (including pdf download)^*
provides interactive access to allow further investigation

^* Using techniques described here

Example

Set up Docker image with support for differential expression and eQTL analysis using R.
Will show parts of relevant Docker file throughout. Complete file is available online.

Dockerfile

A Dockerfile is a plain test file that contains a series of instructions to build a Docker image.
Can use an existing image as base.
Lists instructions to
- install additional software
- configure the software inside the image
- copy files
- …

Choosing a base image

Can either build upon an existing image or start from scratch.
DockerHub provides a large number of images that can serve as a starting point.
For an analysis that relies mostly on R the Bioconductor images are a good starting point.
A minimal Docker file also requires a maintainer field.

FROM bioconductor/release_microarray:latest
MAINTAINER Peter Humburg <peter.humburg@gmail.com>

The Bioconductor images

Bioconductor provides a series of images with different collections of Biocoductor packages installed.
The base image contains R, RStudio and the BiocInstaller package.
The core image contains the base image and basic BioC infrastructure.
Application specific images build on core: flow, microarray, proteomics, sequencing.
All images are available for the development and release versions of Bioconductor.

Customising the image

Install LaTeX, pandoc and a web server.
Add analysis code.
Configure the image to run the analysis and provide access to HTML and PDF versions of resulting report.

Downloading software

Can install software using apt-get or through other channels.

RUN apt-get update -y && apt-get install -y haskell-platform nginx \ 
    lmodern texlive-full libssh-dev  
RUN cabal update && cabal install pandoc

Software not available through a package management system can be downloaded and installed directly.
Install PLINK by downloading the executable.

RUN cd /tmp && wget -q https://www.cog-genomics.org/static/bin/plink150507/plink_linux_x86_64.zip && unzip plink_linux_x86_64.zip && cp plink /usr/local/bin/

Install R packages

Can execute any command in the shell.
Everything installed by previous commands is available.

RUN Rscript -e "biocLite(c('dplyr', 'tidyr', 'devtools', \
        'illuminaHumanv3.db', 'pander', 'ggdendro'))"
RUN Rscript -e "devtools::install_github('hadley/readr')"

Setting-up the analysis

Analysis uses two main files
- Actual analysis contained in RMarkdown file
- Driver script that runs knitr and pandoc, handles errors, copies the completed report to the web server, etc.

Include analysis scripts

Add data and code to the image

COPY data/ /analysis/data/
COPY heatshock_analysis.* default.pandoc /analysis/

Running the analysis

Bioconductor uses Supervisor to run RStudio
Adapt Supervisor config file to also run the analysis on startup.

[program:analysis]
command=/usr/bin/Rscript /analysis/analysis.r
autostart=true
autorestart=false
stdout_logfile=/analysis/log/%(program_name)s.log
stderr_logfile=/analysis/log/%(program_name)s.log

COPY config/supervisored.conf /tmp/
RUN cat /tmp/supervisored.conf >> /etc/supervisor/conf.d/supervisord.conf

Displaying results

Generate report using knitr and pandoc.
Use nginx to serve resulting web page.
May want to restrict access prior to publication.

Using the Docker image

Starting the container

Share host directory with container to store knitr cache to allow cache to persist beyond lifetime of container.
Map ports for RStudio and web site to host ports.

docker run -v /path/to/project/analysis/cache/:/analysis/cache/ \
    -p 9999:80 -p 8787:8787 analysis

Viewing results

Access the HTML report at http://<docker.host.address>:9999
Further explore the data and results interactively through RStudio running at http://<docker.host.address>:8787.

Data Analysis with Docker

Peter Humburg

19th May 2015

Overview

Introduction

What is Docker?

What is Docker? (cont.)

Using Docker

How is this relevant for data analysis?

Building a Docker image for data analysis

Differential expression / eQTL analysis

Example

Dockerfile

Choosing a base image

The Bioconductor images

Customising the image

Downloading software

Install R packages

Setting-up the analysis

Include analysis scripts

Running the analysis

Displaying results

Using the Docker image

Starting the container

Viewing results

Summary

Advantages

Limitations