The data- and design-matrices

Fri, Jul 31, 2020 7-minute read

Introduction

I plan to write a collection of posts describing different aspects of working with expression-based omic data. This will usually be either in the form of transcriptome data (for instance, RNA-seq or arrays) or proteomics (mass spectrometry data). In both cases, an important intermediary format is the expression matrix - a table showing all involved samples as columns and all features (genes, proteins, transcripts, etc.) as rows.

Setting things up

Retrieving the data

As an example, I used the bladderbatch dataset as obtained from Bioconductor - a repository for biology-related R packages.

In order to perform this, the BiocManager R package needs to be installed.

if (!require("bladderbatch")) {
    if (!require("BiocManager")) {
        install.packages("BiocManager")
    }
    BiocManager::install("bladderbatch")
} else {
    message("bladderbatch package already installed")
}
## Loading required package: bladderbatch
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'bladderbatch'
## Loading required package: BiocManager
## Bioconductor version 3.9 (BiocManager 1.30.10), ?BiocManager::install for help
## Bioconductor version '3.9' is out-of-date; the current release version '3.11'
##   is available with R version '4.0'; see https://bioconductor.org/install
## Bioconductor version 3.9 (BiocManager 1.30.10), R 3.6.1 (2019-07-05)
## Installing package(s) 'bladderbatch'
## Installation path not writeable, unable to update packages: boot, class,
##   KernSmooth, lattice, Matrix, mgcv, spatial
## Old packages: 'admisc', 'anytime', 'ape', 'arm', 'backports', 'bit', 'bit64',
##   'brglm2', 'broom', 'calibrate', 'callr', 'car', 'carData', 'caret', 'chron',
##   'classInt', 'corrr', 'data.table', 'dbplyr', 'devtools', 'diffobj', 'dplyr',
##   'DT', 'ellipse', 'ellipsis', 'europepmc', 'factoextra', 'ff', 'ffbase',
##   'fit.models', 'foreach', 'forecast', 'fpc', 'fs', 'future', 'future.apply',
##   'gbm', 'gdtools', 'GGally', 'gganimate', 'ggdendro', 'ggforce', 'ggfortify',
##   'ggplot2', 'ggpubr', 'ggraph', 'git2r', 'glmnet', 'glue', 'gower', 'gplots',
##   'graphlayouts', 'gridSVG', 'gtools', 'HardyWeinberg', 'haven', 'Hmisc',
##   'htmlTable', 'htmltools', 'httpuv', 'httr', 'igraph', 'IRkernel', 'isoband',
##   'janitor', 'jomo', 'jsonlite', 'knitr', 'Lahman', 'later', 'lavaan', 'lme4',
##   'locfit', 'lpSolveAPI', 'lubridate', 'magick', 'maptools', 'MASS', 'MBESS',
##   'mclust', 'mcmc', 'MCMCpack', 'metafor', 'metaviz', 'mice', 'missMDA',
##   'modelr', 'MSnbase', 'multcomp', 'mvtnorm', 'mzR', 'ncdf4', 'nlme', 'nloptr',
##   'nnet', 'OpenMx', 'openssl', 'openxlsx', 'patchwork', 'pillar', 'pkgbuild',
##   'pkgload', 'pkgmaker', 'plotrix', 'PopED', 'pROC', 'processx', 'promises',
##   'ps', 'purrr', 'quantmod', 'quantreg', 'raster', 'RcmdrMisc', 'Rcpp',
##   'RcppArmadillo', 'RCurl', 'recipes', 'remotes', 'reshape2', 'reticulate',
##   'rex', 'rgeos', 'rJava', 'rlang', 'robustbase', 'rootSolve', 'roxygen2',
##   'rpf', 'rrcov', 'rstatix', 'rversions', 'rvest', 'scales', 'segmented',
##   'sem', 'semTools', 'servr', 'sf', 'shiny', 'shinyalert', 'shinycssloaders',
##   'SnowballC', 'sp', 'SQUAREM', 'StanHeaders', 'stopwords', 'stringdist',
##   'survival', 'svglite', 'sys', 'systemfonts', 'tibble', 'tidygraph', 'tidyr',
##   'tidyselect', 'tidytext', 'tinytex', 'tsibble', 'TSP', 'TTR', 'units',
##   'usethis', 'vctrs', 'vdiffr', 'VGAM', 'withr', 'writexl', 'xfun', 'xml2',
##   'yesno', 'zip', 'zoo'

Loading the tidyverse package

if (!require(tidyverse)) {
    install.packages("tidyverse")
} else {
    message("The tidyverse package is already installed")
}
## Loading required package: tidyverse
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## The tidyverse package is already installed
library(tidyverse)

Load the dataset

Here I follow the instructions as given by the package vignette. The data is loaded into the namespace, and subsequently, the expression data is retrieved into the data_df variable and the sample conditions into the design_df variable. Finally, the received matrices (in the R matrix format) are converted to the data.frame format to make them easier to work with below, and the row names of the matrix are extracted into a column gene_id using the command rownames_to_column.

suppressPackageStartupMessages(library(bladderbatch))
data(bladderdata)
data_df <- exprs(bladderEset) %>% data.frame() %>% rownames_to_column("gene_id")
design_df <- pData(bladderEset) %>% data.frame() %>% rownames_to_column("sample_name")

Inspecting the matrices

The data matrix

The data matrix now contains one column containing the gene IDs, and the remaining columns contain gene expression measurements (most likely from the Affymetrix chip platform).

There are two types of columns present. The sample columns (starting with GSM...) contains the actual measurements. Each column contains the measurements for all genes in that particular sample.

The second type is annotation information for the genes, here only present as the gene_id columns specifying the ID of each gene. Additional columns here could include information such as further annotation for the genes, or technical details (transcript length, etc.).

data_df %>% head() %>% select(1:7)
##     gene_id GSM71019.CEL GSM71020.CEL GSM71021.CEL GSM71022.CEL GSM71023.CEL
## 1 1007_s_at    10.115170     8.628044     8.779235     9.248569    10.256841
## 2   1053_at     5.345168     5.063598     5.113116     5.179410     5.181383
## 3    117_at     6.348024     6.663625     6.465892     6.116422     5.980457
## 4    121_at     8.901739     9.439977     9.540738     9.254368     8.798086
## 5 1255_g_at     3.967672     4.466027     4.144885     4.189338     4.078509
## 6   1294_at     7.775183     7.110154     7.248430     7.017220     7.896419
##   GSM71024.CEL
## 1    10.023133
## 2     5.248418
## 3     5.796155
## 4     8.002870
## 5     3.919740
## 6     7.944676

The design matrix

The design matrix contains five columns:

  • sample_name
  • sample
  • outcome
  • batch
  • cancer

This matrix can contain an arbitrary number of columns with information regarding the different samples. Here, we’ll focus on the sample_name column. The values here should exactly match columns present in the data matrix.

design_df %>% head(10)
##     sample_name sample  outcome batch cancer
## 1  GSM71019.CEL      1   Normal     3 Normal
## 2  GSM71020.CEL      2   Normal     2 Normal
## 3  GSM71021.CEL      3   Normal     2 Normal
## 4  GSM71022.CEL      4   Normal     3 Normal
## 5  GSM71023.CEL      5   Normal     3 Normal
## 6  GSM71024.CEL      6   Normal     3 Normal
## 7  GSM71025.CEL      7   Normal     2 Normal
## 8  GSM71026.CEL      8   Normal     2 Normal
## 9  GSM71028.CEL      9 sTCC+CIS     5 Cancer
## 10 GSM71029.CEL     10 sTCC-CIS     2 Cancer

Working with the data and the design matrices

Retrieving all sample data from the data matrix

One of the most common tasks when working with this kind of data is to retrieve the data columns out from the full data matrix. Remember, there are two types of columns present there. In many cases, we want to separately consider the annotation, and do different calculations of visualizations specifically for the data part.

To retrieve the data part, we use the select statement to select all columns in the data matrix matching the sample_name column in the design table.

data_df %>% select(all_of(design_df$sample_name)) %>% head(10) %>% select(1:7)
##    GSM71019.CEL GSM71020.CEL GSM71021.CEL GSM71022.CEL GSM71023.CEL
## 1     10.115170     8.628044     8.779235     9.248569    10.256841
## 2      5.345168     5.063598     5.113116     5.179410     5.181383
## 3      6.348024     6.663625     6.465892     6.116422     5.980457
## 4      8.901739     9.439977     9.540738     9.254368     8.798086
## 5      3.967672     4.466027     4.144885     4.189338     4.078509
## 6      7.775183     7.110154     7.248430     7.017220     7.896419
## 7      5.655863     6.068074     5.822590     5.659677     5.480801
## 8      4.855069     4.971663     4.955943     5.028884     4.838721
## 9      4.942097     4.892139     4.991248     5.163996     5.666031
## 10     3.638684     3.896923     4.057937     3.833727     3.675467
##    GSM71024.CEL GSM71025.CEL
## 1     10.023133     9.108034
## 2      5.248418     5.252312
## 3      5.796155     6.414849
## 4      8.002870     9.093704
## 5      3.919740     4.402590
## 6      7.944676     7.469767
## 7      5.141930     5.757393
## 8      4.688610     5.063306
## 9      5.635379     5.151760
## 10     3.650139     4.044774

By using the all_of statement, the select statement will through a helpful error if we miss certain columns. This is a common error, so doing this using all_of might save you a lot of headaches ahead.

data_df[, 1:10] %>% select(all_of(design_df$sample_name))
## Error: Can't subset columns that don't exist.
## x The columns `GSM71029.CEL`, `GSM71030.CEL`, `GSM71031.CEL`, `GSM71032.CEL`, `GSM71033.CEL`, etc. don't exist.

Retrieving the annotation data

To retrieve the annotation part, we use the select statement again, this time inverting it using the - sign. This will select all columns not present in the design matrix. In this case, this will correspond to only the gene column.

data_df %>% select(-all_of(design_df$sample_name)) %>% head(10)
##      gene_id
## 1  1007_s_at
## 2    1053_at
## 3     117_at
## 4     121_at
## 5  1255_g_at
## 6    1294_at
## 7    1316_at
## 8    1320_at
## 9  1405_i_at
## 10   1431_at

Summary

We learned a standard way of organizing expression data into two matrix-files. One containing is the sample measurements and extra gene information in additional columns. The second matrix contains sample conditions, and a direct link between the two data matrices by having one column matching the sample headers.

This is the format I will rely on ahead when writing future posts.