Here's the R code to generate the data and all the figures here.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Generate some data | |
library(MASS) | |
set.seed(101) | |
n <- 50000 | |
X <- mvrnorm(n, mu=c(.5,2.5), Sigma=matrix(c(1,.6,.6,1), ncol=2)) | |
# A color palette from blue to yellow to red | |
library(RColorBrewer) | |
k <- 11 | |
my.cols <- rev(brewer.pal(k, "RdYlBu")) | |
## compute 2D kernel density, see MASS book, pp. 130-131 | |
z <- kde2d(X[,1], X[,2], n=50) | |
# Make the base plot | |
plot(X, xlab="X label", ylab="Y label", pch=19, cex=.4) | |
# Draw the colored contour lines | |
contour(z, drawlabels=FALSE, nlevels=k, col=my.cols, add=TRUE, lwd=2) | |
# Add lines for the mean of X and Y | |
abline(h=mean(X[,2]), v=mean(X[,1]), col="gray", lwd=1.5) | |
# Add the correlation coefficient to the top left corner | |
legend("topleft", paste("R=", round(cor(X)[1,2],3)), bty="n") | |
## Other methods to fix overplotting | |
# Make points smaller - use a single pixel as the plotting charachter | |
plot(X, pch=".") | |
# Hexbinning | |
library(hexbin) | |
plot(hexbin(X[,1], X[,2])) | |
# Make points semi-transparent | |
library(ggplot2) | |
qplot(X[,1], X[,2], alpha=I(.1)) | |
# The smoothScatter function (graphics package) | |
smoothScatter(X) |
Here's the problem: there are 50,000 points in this plot causing extreme overplotting. (This is a simple multivariate normal distribution, but if the distribution were more complex, overplotting might obscure a relationship in the data that you didn't know about).
I liked the solution they used in the paper referenced above. Contour lines were placed throughout the data indicating the density of the data in that region. Further, the contour lines were "heat" colored from blue to red, indicating increasing data density. Optionally, you can add vertical and horizontal lines that intersect the means, and a legend that includes the absolute correlation coefficient between the two variables.
There are many other ways to solve an overplotting problem - reducing the size of the points, making points transparent, using hex-binning.
Using a single pixel for each data point:
Using hexagonal binning to display density (hexbin package):
Finally, using semi-transparency (10% opacity; easiest using the ggplot2 package):
Edit July 7, 2012 - From Pete's comment below, the smoothScatter() function in the build in graphics package produces a smoothed color density representation of the scatterplot, obtained through a kernel density estimate. You can change the colors using the colramp option, and change how many outliers are plotted with the nrpoints option. Here, 100 outliers are plotted as single black pixels - outliers here being points in the areas of lowest regional density.
How do you deal with overplotting when you have many points?
I'm a big fan of pch = "." for quick and dirty, but there are some really nice ideas here -- thanks! I recently stumbled across color density plots, but I cannot remember the package for the life of me.
ReplyDeleteVery useful examples, Stephen. I like the smoothScatter function in the graphics library. For your example:
ReplyDeletesmoothScatter(X)
You can tweak a number of parameters including how and when outliers are displayed (via "nrpoints") and the colour gradient used (via "colramp").
Very useful examples, thanks so much
ReplyDeletePete - thanks very much. I haven't seen the smoothScatter() function. I updated the post to include that one!
ReplyDeleteTerrific side by side comparison of various over plotting techniques. Thanks for sharing.
ReplyDeleteAnother useful function I've been happy with is 'densCols'
ReplyDeletehttp://stat.ethz.ch/R-manual/R-patched/library/grDevices/html/densCols.html
I like smoothscatter() the best.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteLove this! Is there something similar in Python or are we left to our own devices?
ReplyDelete