ssgoe2017

A repository with the material we'll use at the Data Science Summer School Workshop

View the Project on GitHub

Further Reading

This is a list with material for further study.

It contains links for:

  1. Reproducible Science
  2. Programming
  3. Scientific Python
  4. Git and GitHub
  5. Linux and Linux Command Line
  6. Containers
  7. Machine Learning
  8. General Material
  9. Books

Reproducible Science

The following links, are focusing on Reproducible Research

Reproducible Science Curriculum

This is the curriculum, on which the workshop was based on.

iDigBio, Reproducible Science Workshop

This workshop is also based on the Reproducible Science Curriculum. However, it is using an older version of the program based on R. The “theoretical” topics are still the same though.

Reproducible Research: Walking the Walk

This is a hands-on tutorial, that took place on SciPy 2014. It includes examples of using Docker to create the same environment, and Dexy to automate document creation. All the material is available online, and there is also a recording for the whole duration of the workshop.

Reproducible Research – YouTube playlist

A number of presentation regarding Reproducible Research

Programming

An introduction to programming using Python

Programming Exercises:

Scientific Python

Also a number of great videos and tutorials, for Python and generally for Data Science, can be found at the Enthough and PyData channels on YouTube:

For example:

Git and GitHub

To learn more about Git/Github you can check the following.

Git for Scientists: A Tutorial

An introduction to Git, structured specifically for scientists.

It uses tryGit, so you can follow along, even if you don’t have Git installed on your system.

Atlassian Git Tutorial

A longer tutorial on git, from Atlassian. The examples use Bitbucket, and not GitHub. However the commands are the same

Udacity Git Course

A free course on Git, from Udacity.

ProGit

The standard textbook to get more familiar with Git and Github

Linux and Linux Command Line

Introduction to Linux, edX course

A complete (free) course that will provide you with a good working knowledge of the Linux system and basic command line operations

Containers

Here are some resources regarding the use of general containers (Docker), or some containers specialized in scientific computing (Singularity)

Docker

Docker is platform based on container virtualization. To try Docker, you can either install it from the Docker Website or use the Play with Docker Website. Play with Docker, provides access to a testing virtual machine with docker installed. The machine is destroyed after a few hours. You can use it to follow the workshops presented here.

Introduction to Docker

A very good workshop from Jérôme Petazzoni, on PyCon2016, showing the basic Docker fuctionalitites.

Advanced Docker Workshop

An advanced workshop, from Jérôme Petazzoni, on PyCon2017

Lecture on Container Virtualization

This is a single lecture, getting into more details on container virtualization

Singularity

Singularity containers are containers specialized for HPC and scientific usecases. They are similar to Docker, but more optimized to this usecase.

Singularity: Scientific containers for mobility of compute

An article explaining what singularity containers are and their basic usage.

Singularity: Containers for Science, Reproducibility, and HPC

A youtube presentation of Singularity containers

Machine Learning

The two following (free) books, are considered to be the best books for Machine Learning – even though “Machine Learning” is not in their title.

General Material

Books

Some general (non-technical) book recommendations:



Revised Agenda

Due to the issues with the Jupyter Notebooks, (and the lack of time), we’ll focus more on the other parts of the workshop, that are not so heavily dependent on Jupyter.

16:00 – 16:15

Discussion on the Forensic Exercise (Organization Exercise 01 html)

16:15 – 16:45

Project Organization

16:45 – 17:15

Organization Exercise 02 html

17:15 – 18:15

Publication & Sharing Exercise html


Workshop Agenda

Join the chat at https://gitter.im/ssgoe2017/Lobby

Etherpad Link

S21 Introduction, 11:30 – 12:30

This would be the introductory session for concept of Reproducible Research

S22 Organization & Data Exploration 14:00 – 15:30

In this part we’ll talk about how to better organize a project. Your first (practical) task is to follow the instructions from the Organization Exercise 01.

S23 Automation, 16:00 – 17:30

Here, we’ll automate the analysis from the previous step by implementing functions

S24 Publication & Sharing, 17:30 – 18:30

Here we’ll see how to export a notebook, share it, and publish our results.

References

The majority of the material, is based on the Reproducible Science Curriculum, which was in turn based on Data Carpentry material for reproducible research.

Disclaimer

The opinions expressed in this article are the committer’s own and do not necessarily reflect the views of GWDG.