A repository with the material we'll use at the Data Science Summer School Workshop
This is a list with material for further study.
It contains links for:
The following links, are focusing on Reproducible Research
This is the curriculum, on which the workshop was based on.
This workshop is also based on the Reproducible Science Curriculum. However, it is using an older version of the program based on R. The “theoretical” topics are still the same though.
This is a hands-on tutorial, that took place on SciPy 2014. It includes examples of using Docker to create the same environment, and Dexy to automate document creation. All the material is available online, and there is also a recording for the whole duration of the workshop.
A number of presentation regarding Reproducible Research
An introduction to programming using Python
Programming Exercises:
Also a number of great videos and tutorials, for Python and generally for Data Science, can be found at the Enthough and PyData channels on YouTube:
For example:
To learn more about Git/Github you can check the following.
An introduction to Git, structured specifically for scientists.
It uses tryGit, so you can follow along, even if you don’t have Git installed on your system.
A longer tutorial on git, from Atlassian. The examples use Bitbucket, and not GitHub. However the commands are the same
A free course on Git, from Udacity.
The standard textbook to get more familiar with Git and Github
A complete (free) course that will provide you with a good working knowledge of the Linux system and basic command line operations
Here are some resources regarding the use of general containers (Docker), or some containers specialized in scientific computing (Singularity)
Docker is platform based on container virtualization. To try Docker, you can either install it from the Docker Website or use the Play with Docker Website. Play with Docker, provides access to a testing virtual machine with docker installed. The machine is destroyed after a few hours. You can use it to follow the workshops presented here.
A very good workshop from Jérôme Petazzoni, on PyCon2016, showing the basic Docker fuctionalitites.
An advanced workshop, from Jérôme Petazzoni, on PyCon2017
This is a single lecture, getting into more details on container virtualization
Singularity containers are containers specialized for HPC and scientific usecases. They are similar to Docker, but more optimized to this usecase.
An article explaining what singularity containers are and their basic usage.
A youtube presentation of Singularity containers
The two following (free) books, are considered to be the best books for Machine Learning – even though “Machine Learning” is not in their title.
Some general (non-technical) book recommendations:
Who Moved My Cheese, by Dr Spencer Johnson: A book that teaches you how to deal with constant changes
The Dip, by Seth Godin: Thoughts about when to quit and when to stay.
So Good They Can’t Ignore You, by Cal Newport: On why it’s more important to focus on building your skills, and why “Following your passion” is not always a good advice.
Deep Work, by Cal Newport: By the same author, recommendations for how to work more focused and why that is important.
Thinking, Fast and Slow: A lot of information on how we think and behave.
How to Read a Book: Self explanatory title.
Incerto, by Nassim Nicholas Taleb: A philosophical and practical essay on uncertainty. It is not a “Data Science” book, but it might be the most critical book that you read, regarding Data Science.
Due to the issues with the Jupyter Notebooks, (and the lack of time), we’ll focus more on the other parts of the workshop, that are not so heavily dependent on Jupyter.
Discussion on the Forensic Exercise (Organization Exercise 01 html)
Project Organization
Organization Exercise 02 html
Publication & Sharing Exercise html
This would be the introductory session for concept of Reproducible Research
01_navigation.ipynb and read / evaluate all of it’s cellsIn this part we’ll talk about how to better organize a project. Your first (practical) task is to follow the instructions from the Organization Exercise 01.
Here, we’ll automate the analysis from the previous step by implementing functions
Here we’ll see how to export a notebook, share it, and publish our results.
The majority of the material, is based on the Reproducible Science Curriculum, which was in turn based on Data Carpentry material for reproducible research.
The opinions expressed in this article are the committer’s own and do not necessarily reflect the views of GWDG.