Tuesday, August 28, 2012

More on Exploring Correlations in R

About a year ago I wrote a post about producing scatterplot matrices in R. These are handy for quickly getting a sense of the correlations that exist in your data. Recently someone asked me to pull out some relevant statistics (correlation coefficient and p-value) into tabular format to publish beside a scatterplot matrix. The built-in cor() function will produce a correlation matrix, but what if you want p-values for those correlation coefficients? Also, instead of a matrix, how might you get these statistics in tabular format (variable i, variable j, r, and p, for each i-j combination)? Here's the code (you'll need the PerformanceAnalytics package to produce the plot).

## Correlation matrix with p-values. See http://goo.gl/nahmV for documentation of this function
cor.prob <- function (X, dfr = nrow(X) - 2) {
R <- cor(X, use="pairwise.complete.obs")
above <- row(R) < col(R)
r2 <- R[above]^2
Fstat <- r2 * dfr/(1 - r2)
R[above] <- 1 - pf(Fstat, 1, dfr)
R[row(R) == col(R)] <- NA
R
}
## Use this to dump the cor.prob output to a 4 column matrix
## with row/column indices, correlation, and p-value.
## See StackOverflow question: http://goo.gl/fCUcQ
flattenSquareMatrix <- function(m) {
if( (class(m) != "matrix") | (nrow(m) != ncol(m))) stop("Must be a square matrix.")
if(!identical(rownames(m), colnames(m))) stop("Row and column names must be equal.")
ut <- upper.tri(m)
data.frame(i = rownames(m)[row(m)[ut]],
j = rownames(m)[col(m)[ut]],
cor=t(m)[ut],
p=m[ut])
}
# get some data from the mtcars built-in dataset
mydata <- mtcars[, c(1,3,4,5,6)]
# correlation matrix
cor(mydata)
# correlation matrix with p-values
cor.prob(mydata)
# "flatten" that table
flattenSquareMatrix(cor.prob(mydata))
# plot the data
library(PerformanceAnalytics)
chart.Correlation(mydata)

The cor() function will produce a basic correlation matrix.  12 years ago Bill Venables provided a function on the R help mailing list for replacing the upper triangle of the correlation matrix with the p-values for those correlations (based on the known relationship between t and r). The cor.prob() function will produce this matrix.

Finally, the flattenSquareMatrix() function will "flatten" this matrix to four columns: one column for variable i, one for variable j, one for their correlation, and another for their p-value (thanks to Chris Wallace on StackOverflow for helping out with this one).

> cor(mydata)
mpg disp hp drat wt
mpg 1.0000000 -0.8475514 -0.7761684 0.6811719 -0.8676594
disp -0.8475514 1.0000000 0.7909486 -0.7102139 0.8879799
hp -0.7761684 0.7909486 1.0000000 -0.4487591 0.6587479
drat 0.6811719 -0.7102139 -0.4487591 1.0000000 -0.7124406
wt -0.8676594 0.8879799 0.6587479 -0.7124406 1.0000000
> cor.prob(mydata)
mpg disp hp drat wt
mpg NA 9.380327e-10 1.787835e-07 1.776240e-05 1.293958e-10
disp -0.8475514 NA 7.142679e-08 5.282022e-06 1.222322e-11
hp -0.7761684 7.909486e-01 NA 9.988772e-03 4.145827e-05
drat 0.6811719 -7.102139e-01 -4.487591e-01 NA 4.784260e-06
wt -0.8676594 8.879799e-01 6.587479e-01 -7.124406e-01 NA
> flattenSquareMatrix(cor.prob(mydata))
i j cor p
1 mpg disp -0.8475514 9.380327e-10
2 mpg hp -0.7761684 1.787835e-07
3 disp hp 0.7909486 7.142679e-08
4 mpg drat 0.6811719 1.776240e-05
5 disp drat -0.7102139 5.282022e-06
6 hp drat -0.4487591 9.988772e-03
7 mpg wt -0.8676594 1.293958e-10
8 disp wt 0.8879799 1.222322e-11
9 hp wt 0.6587479 4.145827e-05
10 drat wt -0.7124406 4.784260e-06


Finally, the chart.Correlation() function from the PerformanceAnalytics package produces a very nice scatterplot matrix, with histograms, kernel density overlays, absolute correlations, and significance asterisks (0.05, 0.01, 0.001):

Wednesday, August 1, 2012

Cscan: Finding Gene Expression Regulators with ENCODE ChIP-Seq Data

Recently published in Nucleic Acids Research:

F. Zambelli, G. M. Prazzoli, G. Pesole, G. Pavesi, Cscan: finding common regulators of a set of genes by using a collection of genome-wide ChIP-seq datasets., Nucleic acids research 40, W510–5 (2012).

Cscan web interface screenshot

This paper presents a methodology and software implementation that allows users to discover a set of transcription factors or epigenetic modifications that regulate a set of genes of interest. A wealth of data about transcription factor binding exists in the public domain, and this is a good example of a group utilizing those resources to develop tools that are of use to the broader computational biology community. 

High-throughput gene expression experiments like microarrays and RNA-seq experiments often result in a list of differentially regulated or co-expressed genes. A common follow-up question asks which transcription factors may regulate those genes of interest. The ENCODE project has completed ChIP-seq experiments for many transcription factors and epigenetic modifications for a number of different cell lines in both human and model organisms. These researchers crossed this publicly available data on enriched regions from ChIP-seq experiments with genomic coordinates of gene annotations to create a table of gene annotations (rows) by ChIP-peak signals, with a presence/absence peak in each cell. Given a set of genes of interest (e.g. differentially regulated genes from an RNA-seq experiment), the method evaluates the over-/under-representation of target sites for the DNA binding protein in each ChIP experiment using a Fisher's exact test. Other methods based on motif-enrichment (using position weight matrices derived from databases like TRANSFAC or JASPAR) would miss DNA-binding factors like the Retinoblastoma protein (RB), which lacks a DNA-binding domain and is recruited to promoters by other transcription factors. In addition to overcoming this limitation, the method presented here also has the advantage of considering tissue-specificity and chromatin accessibility.

The web interface is free and doesn't require registration: http://www.beaconlab.it/cscan

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.