I saw this plot in the supplement of a recent paper comparing microarray results to RNA-seq results. Nothing earth-shattering in the paper - you've probably seen a similar comparison many times before - but I liked how they solved the overplotting problem using heat-colored contour lines to indicate density. I asked how to reproduce this figure using R on Stack Exchange, and my question was quickly answered by Christophe Lalanne.
Here's the R code to generate the data and all the figures here.
Here's the problem: there are 50,000 points in this plot causing extreme overplotting. (This is a simple multivariate normal distribution, but if the distribution were more complex, overplotting might obscure a relationship in the data that you didn't know about).
I liked the solution they used in the paper referenced above. Contour lines were placed throughout the data indicating the density of the data in that region. Further, the contour lines were "heat" colored from blue to red, indicating increasing data density. Optionally, you can add vertical and horizontal lines that intersect the means, and a legend that includes the absolute correlation coefficient between the two variables.
There are many other ways to solve an overplotting problem - reducing the size of the points, making points transparent, using hex-binning.
Using a single pixel for each data point:
Using hexagonal binning to display density (hexbin package):
Finally, using semi-transparency (10% opacity; easiest using the ggplot2 package):
Edit July 7, 2012 - From Pete's comment below, the smoothScatter() function in the build in graphics package produces a smoothed color density representation of the scatterplot, obtained through a kernel density estimate. You can change the colors using the colramp option, and change how many outliers are plotted with the nrpoints option. Here, 100 outliers are plotted as single black pixels - outliers here being points in the areas of lowest regional density.
How do you deal with overplotting when you have many points?
I'm a big fan of pch = "." for quick and dirty, but there are some really nice ideas here -- thanks! I recently stumbled across color density plots, but I cannot remember the package for the life of me.
ReplyDeleteVery useful examples, Stephen. I like the smoothScatter function in the graphics library. For your example:
ReplyDeletesmoothScatter(X)
You can tweak a number of parameters including how and when outliers are displayed (via "nrpoints") and the colour gradient used (via "colramp").
Very useful examples, thanks so much
ReplyDeletePete - thanks very much. I haven't seen the smoothScatter() function. I updated the post to include that one!
ReplyDeleteTerrific side by side comparison of various over plotting techniques. Thanks for sharing.
ReplyDeleteAnother useful function I've been happy with is 'densCols'
ReplyDeletehttp://stat.ethz.ch/R-manual/R-patched/library/grDevices/html/densCols.html
I like smoothscatter() the best.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteLove this! Is there something similar in Python or are we left to our own devices?
ReplyDelete