Why care about open (data) science?
|
Make your data and code available to others
Make your analyses reproducible
Make a sharp distincion between exploratory and confirmatory research
|
Introducing R and RStudio IDE
|
R is a powerful, popular open-source scripting language
You can customize the layout of RStudio, and use the project feature to manage the files and packages used in your analysis
RStudio allows you to run R in an easy-to-use interface and makes it easy to find help
|
Collaborating with Github
|
Github allows you to synchronise work efforts and collaborate with other scientists on (R) code.
Github can be used to make custom website visible on the internet.
Merge conflicts can arise between you and yourself (different machines).
Merge conflicts arise when you collaborate and are a safe way to handle discordance.
Efficient collaboration on data analysis can be made using Github.
|
R Basics
|
Effectively using R is a journey of months or years. Still you don’t have to be an expert to use R and you can start using and analyzing your data with with about a day’s worth of training
It is important to understand how data are organized by R in a given object type and how the mode of that type (e.g. numeric, character, logical, etc.) will determine how R will operate on that data.
Working with vectors effectively prepares you for understanding how data are organized in R.
|
Introduction to the example dataset and file type
|
The dataset comes from a real world experiment in E. coli.
Publicly available FASTQ files can be downloaded from NCBI SRA.
Several steps are taken outside of R/RStudio to create VCF files from FASTQ files.
VCF files store variant calls in a special format.
|
R Basics continued - factors and data frames
|
It is easy to import data into R from tabular formats including Excel. However, you still need to check that R has imported and interpreted your data correctly
There are best practices for organizing your data (keeping it tidy) and R is great for this
Base R has many useful functions for manipulating your data, but all of R’s capabilities are greatly enhanced by software packages developed by the community
|
Using packages from Bioconductor
|
Bioconductor is an alternative package repository for bioinformatics packages.
Installing packages from Bioconductor requires a new method, since it is not compatible with the install.packages() function used for CRAN.
Check Bioconductor to see if there is a package relevent to your analysis before writing code yourself.
|
Data Wrangling and Analyses with Tidyverse
|
Use the dplyr package to manipulate data frames.
Use glimpse() to quickly look at your data frame.
Use select() to choose variables from a data frame.
Use filter() to choose data based on values.
Use mutate() to create new variables.
Use group_by() and summarize() to work with subsets of data.
|
Data Visualization with ggplot2
|
|
Producing Reports With knitr
|
Keep reporting and R software together in one document using R Markdown.
Control formatting using chunk options.
knitr can convert R Markdown documents to PDF and other formats.
|
Getting help with R
|
R provides thousands of functions for analyzing data, and provides several way to get help
Using R will mean searching for online help, and there are tips and resources on how to search effectively
|