This lesson is in the early stages of development (Alpha version)

Intro to R and Open Science Practices for Biologists: Glossary

Key Points

Why care about open (data) science?
  • Make your data and code available to others

  • Make your analyses reproducible

  • Make a sharp distincion between exploratory and confirmatory research

Introducing R and RStudio IDE
  • R is a powerful, popular open-source scripting language

  • You can customize the layout of RStudio, and use the project feature to manage the files and packages used in your analysis

  • RStudio allows you to run R in an easy-to-use interface and makes it easy to find help

Collaborating with Github
  • Github allows you to synchronise work efforts and collaborate with other scientists on (R) code.

  • Github can be used to make custom website visible on the internet.

  • Merge conflicts can arise between you and yourself (different machines).

  • Merge conflicts arise when you collaborate and are a safe way to handle discordance.

  • Efficient collaboration on data analysis can be made using Github.

R Basics
  • Effectively using R is a journey of months or years. Still you don’t have to be an expert to use R and you can start using and analyzing your data with with about a day’s worth of training

  • It is important to understand how data are organized by R in a given object type and how the mode of that type (e.g. numeric, character, logical, etc.) will determine how R will operate on that data.

  • Working with vectors effectively prepares you for understanding how data are organized in R.

Introduction to the example dataset and file type
  • The dataset comes from a real world experiment in E. coli.

  • Publicly available FASTQ files can be downloaded from NCBI SRA.

  • Several steps are taken outside of R/RStudio to create VCF files from FASTQ files.

  • VCF files store variant calls in a special format.

R Basics continued - factors and data frames
  • It is easy to import data into R from tabular formats including Excel. However, you still need to check that R has imported and interpreted your data correctly

  • There are best practices for organizing your data (keeping it tidy) and R is great for this

  • Base R has many useful functions for manipulating your data, but all of R’s capabilities are greatly enhanced by software packages developed by the community

Using packages from Bioconductor
  • Bioconductor is an alternative package repository for bioinformatics packages.

  • Installing packages from Bioconductor requires a new method, since it is not compatible with the install.packages() function used for CRAN.

  • Check Bioconductor to see if there is a package relevent to your analysis before writing code yourself.

Data Wrangling and Analyses with Tidyverse
  • Use the dplyr package to manipulate data frames.

  • Use glimpse() to quickly look at your data frame.

  • Use select() to choose variables from a data frame.

  • Use filter() to choose data based on values.

  • Use mutate() to create new variables.

  • Use group_by() and summarize() to work with subsets of data.

Data Visualization with ggplot2
  • ggplot2 is a powerful tool for high-quality plots

  • ggplot2 provides a flexiable and readable grammar to build plots

Producing Reports With knitr
  • Keep reporting and R software together in one document using R Markdown.

  • Control formatting using chunk options.

  • knitr can convert R Markdown documents to PDF and other formats.

Getting help with R
  • R provides thousands of functions for analyzing data, and provides several way to get help

  • Using R will mean searching for online help, and there are tips and resources on how to search effectively

Glossary

FIXME