Monday, July 25, 2011

Scatterplot matrices in R

I just discovered a handy function in R to produce a scatterplot matrix of selected variables in a dataset. The base graphics function is pairs(). Producing these plots can be helpful in exploring your data, especially using the second method below.

Try it out on the built in iris dataset. (data set gives the measurements in cm of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica).

# Load the iris dataset.
# Plot #1: Basic scatterplot matrix of the four measurements
pairs(~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, data=iris)
Looking at the pairs help page I found that there's another built-in function, panel.smooth(), that can be used to plot a loess curve for each plot in a scatterplot matrix. Pass this function to the lower.panel argument of the pairs function. The panel.cor() function below can compute the absolute correlation between pairs of variables, and display these in the upper panels, with the font size proportional to the absolute value of the correlation.

# panel.smooth function is built in.
# panel.cor puts correlation in upper panels, size proportional to correlation
panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...)
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits=digits)[1]
    txt <- paste(prefix, txt, sep="")
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
# Plot #2: same as above, but add loess smoother in lower and correlation in upper
pairs(~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, data=iris,
      lower.panel=panel.smooth, upper.panel=panel.cor, 
      pch=20, main="Iris Scatterplot Matrix")

Finally, you can produce a similar plot using ggplot2, with the diagonal showing the kernel density.

# Plot #3: similar plot using ggplot2
# install.packages("ggplot2") ## uncomment to install ggplot2
plotmatrix(with(iris, data.frame(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)))

See more on the pairs function here.


Update:  A tip of the hat to Hadley Wickham (@hadleywickham) for pointing out two packages useful for scatterplot matrices. The gpairs package has some useful functionality for showing the relationship between both continuous and categorical variables in a dataset, and the GGally package extends ggplot2 for plot matrices.

Tuesday, July 12, 2011

Download 69 Complete Human Genomes

Sequencing company Complete Genomics recently made available 69 ethnically diverse complete human genome sequences: a Yoruba trio; a Puerto Rican trio; a 17-member, 3-generation pedigree; and a diversity panel representing 9 different populations. Some of the samples partially overlap with HapMap and the 1000 Genomes Project. The data can be downloaded directly from the FTP site. See the link below for more details on the directory contents, and have a look at the quick start guide to working with complete genomics data.

Complete Genomics - Sample Human Genome Sequence Data
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.