Chapter 1 Getting Started in R

last updated: 2023-10-27

Installing Packages

First things first: Click the “Visual” button in the top-left corner of the code box. This makes the code look more like a word processor. You can always switch back to Source anytime you prefer.

The following code installs a set of R packages used in this document – if not already installed – and then loads the packages into R. Note that we utilize the US CRAN repository, but other repositories may be more convenient according to geographic location.

if (!require("pacman")) install.packages("pacman"); library(pacman)

# the p_load function 
#    A) installs the package if not installed (like install.packages("package_name")),
#    B) loads the package (equivalent of library(package_name))

p_load("tidyverse", # An ecosystem of packages for making life in R easier
       "here", # For locating files easily
       "knitr", # For generating ("knitting") html or pdf files from .Rmd file
       "readr", # For faster and easier reading in files to R
       "pander", # For session info at the end of the document
       "BiocManager", # For installing Bioconductor R packages
       "dplyr" # A key part of the tidyverse ecosystem, has useful functions
       )

1.1 Exercise Description

This activity is intended to familiarize you with using RStudio and the R ecosystem to analyze genomic data

1.2 Learning outcomes

At the end of this exercise, you should be able to:

  • open, modify, and knit an Rmd file to a pdf/html output
  • relate Rmarkdown to a traditional lab notebook
  • run commands in an Rmarkdown file

1.3 Using R and RStudio

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

# print a statement
print("R code in a .Rmd chunk works just like a script")
## [1] "R code in a .Rmd chunk works just like a script"
# preform basic calculations
2+2
## [1] 4

R is a useful tool for analyzing data. Let’s download a data file from GitHub to work with. First, we will download the file manually and open it. Later, we will download the same file directly from the url.

  • Click here to open the file in GitHub and click the download icon to download it to your computer.

  • Use the “Import Dataset” in the Environment panel of RStudio to open the file browser and select the downloaded file

    • You’ll want to use the “From text (readr)…” option

    • Adjust settings to make sure the file loads in properly.

    • Copy the code that the Import Dataset feature provides for reading in the file and paste it in the code chunk below

# insert here the code used to load the file in from your computer

1.4 Load data directly from the URL

Rather than downloading the file manually and then loading it in from where we downloaded it to, we can just load it directly from the URL, as shown below. A word of caution, this won’t work with any URL and you can’t guarantee the URL will always work in the future.

# assign url to a variable
DE_data_url <- "https://raw.githubusercontent.com/clstacy/GenomicDataAnalysis_Fa23/main/data/ethanol_stress/msn2-4_mutants_EtOH.txt"

# download the data from the web
DE_results_msn24_EtOH <-
  read_tsv(file=DE_data_url)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 5756 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr  (3): Gene ID, Common Name, Annotation
## dbl (15): logFC: YPS606 (WT) EtOH response, Pvalue: YPS606 (WT) EtOH respons...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Do remember that this function uses the package readr (a part of the tidyverse package we loaded above). If you don’t have that package (1) installed and (2) loaded into your script, it won’t work. Thankfully, the p_load function takes care of both of these simultaneously.

1.5 Working with data in R

To get a quick summary of our data and how it looks

# take a quick look at how the data is structured
glimpse(DE_results_msn24_EtOH)
## Rows: 5,756
## Columns: 18
## $ `Gene ID`                                <chr> "YMR105C", "YML100W", "YER053…
## $ `Common Name`                            <chr> "PGM2", "TSL1", "PIC2", "NCE1…
## $ Annotation                               <chr> "Phosphoglucomutase", "Large …
## $ `logFC: YPS606 (WT) EtOH response`       <dbl> 7.5999973, 7.7618280, 6.69400…
## $ `Pvalue: YPS606 (WT) EtOH response`      <dbl> 9.40e-38, 1.04e-35, 3.03e-39,…
## $ `FDR: YPS606 (WT) EtOH response`         <dbl> 3.26e-35, 1.54e-33, 2.07e-36,…
## $ `logFC: YPS606 msn2/4ΔΔ EtOH response`   <dbl> 0.78481798, 0.60949852, 1.735…
## $ `Pvalue: YPS606 msn2/4ΔΔ  EtOH response` <dbl> 3.430000e-06, 8.401730e-04, 4…
## $ `FDR: YPS606 msn2/4ΔΔ  EtOH response`    <dbl> 7.420000e-06, 1.398507e-03, 2…
## $ `logFC: WT v msn2/4ΔΔ: EtOH response`    <dbl> -6.815179, -7.152329, -4.9580…
## $ `Pvalue: WT v msn2/4ΔΔ: EtOH response`   <dbl> 6.34e-32, 2.53e-30, 1.35e-27,…
## $ `FDR: WT v msn2/4ΔΔ: EtOH response`      <dbl> 3.65e-28, 7.28e-27, 2.59e-24,…
## $ `logFC: WT v msn2/4ΔΔ: unstressed`       <dbl> -0.144061475, -0.365016862, -…
## $ `Pvalue: WT v msn2/4ΔΔ: unstressed`      <dbl> 0.350436027, 0.041423492, 0.4…
## $ `FDR: WT v msn2/4ΔΔ:unstressed`          <dbl> 0.998531082, 0.998531082, 0.9…
## $ `logFC: WT v msn2/4ΔΔ: EtOH absolute`    <dbl> -6.959241, -7.517346, -5.0845…
## $ `Pvalue: WT v msn2/4ΔΔ: EtOH absolute`   <dbl> 8.55e-37, 2.04e-35, 3.06e-36,…
## $ `FDR: WT v msn2/4ΔΔ: EtOH absolute`      <dbl> 1.64e-33, 1.96e-32, 3.52e-33,…

We see in the output there are 5756 rows and 18 columns in the data. The same information should be available in the environment panel of RStudio

1.6 Looking at Data in RStudio

If we want to take a closer look at the data, we have a few options. To see just the first few lines we can run the following command:

head(DE_results_msn24_EtOH)
## # A tibble: 6 × 18
##   `Gene ID` `Common Name` Annotation                      logFC: YPS606 (WT) E…¹
##   <chr>     <chr>         <chr>                                            <dbl>
## 1 YMR105C   PGM2          Phosphoglucomutase                               7.60 
## 2 YML100W   TSL1          Large subunit of trehalose 6-p…                  7.76 
## 3 YER053C   PIC2          Mitochondrial copper and phosp…                  6.69 
## 4 YPR149W   NCE102        Protein involved in regulation…                  0.714
## 5 YKL035W   UGP1          UDP-glucose pyrophosphorylase …                  4.42 
## 6 YLR258W   GSY2          Glycogen synthase                                7.52 
## # ℹ abbreviated name: ¹​`logFC: YPS606 (WT) EtOH response`
## # ℹ 14 more variables: `Pvalue: YPS606 (WT) EtOH response` <dbl>,
## #   `FDR: YPS606 (WT) EtOH response` <dbl>,
## #   `logFC: YPS606 msn2/4ΔΔ EtOH response` <dbl>,
## #   `Pvalue: YPS606 msn2/4ΔΔ  EtOH response` <dbl>,
## #   `FDR: YPS606 msn2/4ΔΔ  EtOH response` <dbl>,
## #   `logFC: WT v msn2/4ΔΔ: EtOH response` <dbl>, …

This can be difficult to look at. For looking at data similar to an Excel file, RStudio allows this by clicking on the name of the data.frame in the top right corner of the IDE. We can also view a file by typing View(filename). To open the data in a new window, click the “pop out” button next to “filter” just above the opened dataset.

1.7 Exploring the data

This dataset includes the log fold changes of gene expression in an experiment testing the ethanol stress response for the YPS606 strain of S. cerevisiae and an msn2/4ΔΔ mutant. There are also additional columns of metadata about each gene. In later classes, we will cover the details included, but we can already start answering questions.

Using RStudio, answer the following questions:

  1. How many genes are included in this study?

  2. Which gene has the highest log fold change in the msn2/4ΔΔ mutant EtOH response?

  3. How many HSP genes are differentially expressed (FDR < 0.01) in unstressed conditions for the mutant?

  4. Do the genes with the largest magnitude fold changes have the smallest p-values?

  5. Which isoform of phosphoglucomutase is upregulated in response to ethanol stress? Do you think msn2/4 is responsible for this difference?

Be sure to knit this file into a pdf or html file once you’re finished.


System information for reproducibility:

pander::pander(sessionInfo())

R version 4.3.1 (2023-06-16)

Platform: aarch64-apple-darwin20 (64-bit)

locale: en_US.UTF-8||en_US.UTF-8||en_US.UTF-8||C||en_US.UTF-8||en_US.UTF-8

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: BiocManager(v.1.30.22), pander(v.0.6.5), knitr(v.1.44), here(v.1.0.1), lubridate(v.1.9.3), forcats(v.1.0.0), stringr(v.1.5.0), dplyr(v.1.1.3), purrr(v.1.0.2), readr(v.2.1.4), tidyr(v.1.3.0), tibble(v.3.2.1), ggplot2(v.3.4.4), tidyverse(v.2.0.0) and pacman(v.0.5.1)

loaded via a namespace (and not attached): sass(v.0.4.7), utf8(v.1.2.3), generics(v.0.1.3), stringi(v.1.7.12), hms(v.1.1.3), digest(v.0.6.33), magrittr(v.2.0.3), evaluate(v.0.22), grid(v.4.3.1), timechange(v.0.2.0), bookdown(v.0.36), fastmap(v.1.1.1), rprojroot(v.2.0.3), jsonlite(v.1.8.7), fansi(v.1.0.5), scales(v.1.2.1), codetools(v.0.2-19), jquerylib(v.0.1.4), cli(v.3.6.1), crayon(v.1.5.2), rlang(v.1.1.1), bit64(v.4.0.5), munsell(v.0.5.0), withr(v.2.5.1), cachem(v.1.0.8), yaml(v.2.3.7), parallel(v.4.3.1), tools(v.4.3.1), tzdb(v.0.4.0), colorspace(v.2.1-0), curl(v.5.1.0), vctrs(v.0.6.4), R6(v.2.5.1), lifecycle(v.1.0.3), bit(v.4.0.5), vroom(v.1.6.4), pkgconfig(v.2.0.3), pillar(v.1.9.0), bslib(v.0.5.1), gtable(v.0.3.4), glue(v.1.6.2), Rcpp(v.1.0.11), xfun(v.0.40), tidyselect(v.1.2.0), rstudioapi(v.0.15.0), htmltools(v.0.5.6.1), rmarkdown(v.2.25) and compiler(v.4.3.1)