Getting Genetics Done: Visualization

Showing posts with label Visualization. Show all posts

Friday, January 8, 2016

Repel overlapping text labels in ggplot2

A while back I showed you how to make volcano plots in base R for visualizing gene expression results. This is just one of many genome-scale plots where you might want to show all individual results but highlight or call out important results by labeling them, for example, with a gene name.

But if you want to annotate lots of points, the annotations usually get so crowded that they overlap one another and become illegible. There are ways around this - reducing the font size, or adjusting the position or angle of the text, but these usually don’t completely solve the problem, and can even make the visualization worse. Here’s the plot again, reading the results directly from GitHub, and drawing the plot with ggplot2 and geom_text out of the box.

What a mess. It’s difficult to see what any of those downregulated genes are on the left. Enter the ggrepel package, a new extension of ggplot2 that repels text labels away from one another. Just sub in geom_text_repel() in place of geom_text() and the extension is smart enough to try to figure out how to label the points such that the labels don’t interfere with each other. Here it is in action.

And the result (much better!):

See the ggrepel package vignette for more.

Tuesday, April 21, 2015

R: single plot with two different y-axes

I forgot where I originally found the code to do this, but I recently had to dig it out again to remind myself how to draw two different y axes on the same plot to show the values of two different features of the data. This is somewhat distinct from the typical use case of aesthetic mappings in ggplot2 where I want to have different lines/points/colors/etc. for the same feature across multiple subsets of data.

For example, I was recently poking around with some data examining enrichment of a particular set of genes using a hypergeometric test as I was fiddling around with other parameters that included more genes in the selection (i.e., in the classic example, the number of balls drawn from some hypothetical urn). I wanted to show the -log10(p-value) on one axis and some other value (e.g., “n”) on the same plot, using a different axis on the right side of the plot.

Here’s how to do it. First, generate some data:

set.seed(2015-04-13)

d = data.frame(x =seq(1,10),
           n = c(0,0,1,2,3,4,4,5,6,6),
           logp = signif(-log10(runif(10)), 2))

x	n	logp
1	0	1.400
2	0	0.590
3	1	1.200
4	2	1.500
5	3	0.028
6	4	0.380
7	4	2.500
8	5	0.067
9	6	0.041
10	6	0.360

The strategy here is to first draw one of the plots, then draw another plot on top of the first one, and manually add in an axis. So let’s draw the first plot, but leave some room on the right hand side to draw an axis later on. I’m drawing a red line plot showing the p-value as it changes over values of x.

par(mar = c(5,5,2,5))
with(d, plot(x, logp, type="l", col="red3", 
             ylab=expression(-log[10](italic(p))),
             ylim=c(0,3)))

Now, draw the second plot on top of the first using the par(new=T) call. Draw the plot, but don’t include an axis yet. Put the axis on the right side (axis(...)), and add text to the margin (mtext...). Finally, add a legend.

par(new = T)
with(d, plot(x, n, pch=16, axes=F, xlab=NA, ylab=NA, cex=1.2))
axis(side = 4)
mtext(side = 4, line = 3, 'Number genes selected')
legend("topleft",
       legend=c(expression(-log[10](italic(p))), "N genes"),
       lty=c(1,0), pch=c(NA, 16), col=c("red3", "black"))

Friday, April 10, 2015

R User Group Recap: Heatmaps and Using the caret Package

At our most recent R user group meeting we were delighted to have presentations from Mark Lawson and Steve Hoang, both bioinformaticians at Hemoshear. All of the code used in both demos is in our Meetup’s GitHub repo.

Making heatmaps in R

Steve started with an overview of making heatmaps in R. Using the iris dataset, Steve demonstrated making heatmaps of the continuous iris data using the heatmap.2 function from the gplots package, the aheatmap function from NMF, and the hard way using ggplot2. The “best in class” method used aheatmap to draw an annotated heatmap plotting z-scores of columns and annotated rows instead of raw values, using the Pearson correlation instead of Euclidean distance as the distance metric.

library(dplyr)
library(NMF)
library(RColorBrewer)
iris2 = iris # prep iris data for plotting
rownames(iris2) = make.names(iris2$Species, unique = T)
iris2 = iris2 %>% select(-Species) %>% as.matrix()
aheatmap(iris2, color = "-RdBu:50", scale = "col", breaks = 0,
         annRow = iris["Species"], annColors = "Set2", 
         distfun = "pearson", treeheight=c(200, 50), 
         fontsize=13, cexCol=.7, 
         filename="heatmap.png", width=8, height=16)

Classification and regression using caret

Mark wrapped up with a gentle introduction to the caret package for classification and regression training. This demonstration used the caret package to split data into training and testing sets, and run repeated cross-validation to train random forest and penalized logistic regression models for classifying Fisher’s iris data.

First, get a look at the data with the featurePlot function in the caret package:

library(caret)
set.seed(42)
data(iris)
featurePlot(x = iris[, 1:4],
            y = iris$Species,
            plot = "pairs",
            auto.key = list(columns = 3))

Next, after splitting the data into training and testing sets and using the caret package to automate training and testing both random forest and partial least squares models using repeated 10-fold cross-validation (see the code), it turns out random forest outperforms PLS in this case, and performs fairly well overall:

	setosa	versicolor	virginica
Sensitivity	1.00	1.00	0.00
Specificity	1.00	0.50	1.00
Pos Pred Value	1.00	0.50	NaN
Neg Pred Value	1.00	1.00	0.67
Prevalence	0.33	0.33	0.33
Detection Rate	0.33	0.33	0.00
Detection Prevalence	0.33	0.67	0.00
Balanced Accuracy	1.00	0.75	0.50

A big thanks to Mark and Steve at Hemoshear for putting this together!

Wednesday, May 28, 2014

Using Volcano Plots in R to Visualize Microarray and RNA-seq Results

I've been asked a few times how to make a so-called volcano plot from gene expression results. A volcano plot typically plots some measure of effect on the x-axis (typically the fold change) and the statistical significance on the y-axis (typically the -log10 of the p-value). Genes that are highly dysregulated are farther to the left and right sides, while highly significant changes appear higher on the plot.

I've analyzed some data from GEO (GSE52202) using RNA-seq to study gene expression in motor neurons differentiated from induced pluripotent stem cells (iPSCs) derived from ALS patients carrying the C9ORF72 repeat expansion. I aligned the data, counted with featureCounts, and analyzed with DESeq2. I uploaded the results to this GitHub Gist.

Here's how you can use R to create a simple volcano plot. First, download the results file here and save it as a text file called results.txt.

After reading in the data from GitHub the next section creates a basic volcano plot. A few more lines color the points based on their fold change and statistical significance. Finally, if you have the calibrate package installed, the last line labels a few genes of interest.

Thursday, May 15, 2014

qqman: an R package for creating Q-Q and manhattan plots from GWAS results

Three years ago I wrote a blog post on how to create manhattan plots in R. After hundreds of comments pointing out bugs and other issues, I've finally cleaned up this code and turned it into an R package.

The qqman R package is on CRAN: http://cran.r-project.org/web/packages/qqman/

The source code is on GitHub: https://github.com/stephenturner/qqman

If you'd like to cite the qqman package (appreciated but not required), please cite this pre-print: Turner, S.D. qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. biorXiv DOI: 10.1101/005165 (2014).

Something wrong? Please file bug reports, feature requests, or anything else related to the code as an issue on GitHub rather than commenting here. Also, please post the code you're using and some example data causing a failure in a publicly accessible place, such as a GitHub gist (no registration required). It's difficult to troubleshoot if I can't see the data where the code is failing. Want to contribute? Awesome! Send me a pull request.

Note: This release is substantially simplified for the sake of maintainability and creating an R package. The old code that allows confidence intervals on the Q-Q plot and allows more flexible annotation and highlighting is still available at the version 0.0.0 release on GitHub.

Here's a shout-out to all the blog commenters on the previous post for pointing out bugs and other issues, and a special thanks to Dan Capurso and Tim Knutsen for useful contributions and bugfixes.

qqman package tutorial

First things first. Install the package (do this only once), then load the package (every time you start a new R session)

# only once:
install.packages("qqman")

# each time:
library(qqman)

You can access this help any time from within R by accessing the vignette:

vignette("qqman")

The qqman package includes functions for creating manhattan plots and q-q plots from GWAS results. The gwasResults data.frame included with the package has simulated results for 16,470 SNPs on 22 chromosomes. Take a look at the data:

str(gwasResults)

'data.frame':   16470 obs. of  4 variables:
 $ SNP: chr  "rs1" "rs2" "rs3" "rs4" ...
 $ CHR: int  1 1 1 1 1 1 1 1 1 1 ...
 $ BP : int  1 2 3 4 5 6 7 8 9 10 ...
 $ P  : num  0.915 0.937 0.286 0.83 0.642 ...

head(gwasResults)

  SNP CHR BP      P
1 rs1   1  1 0.9148
2 rs2   1  2 0.9371
3 rs3   1  3 0.2861
4 rs4   1  4 0.8304
5 rs5   1  5 0.6417
6 rs6   1  6 0.5191

tail(gwasResults)

          SNP CHR  BP      P
16465 rs16465  22 530 0.5644
16466 rs16466  22 531 0.1383
16467 rs16467  22 532 0.3937
16468 rs16468  22 533 0.1779
16469 rs16469  22 534 0.2393
16470 rs16470  22 535 0.2630

How many SNPs on each chromosome?

as.data.frame(table(gwasResults$CHR))

   Var1 Freq
1     1 1500
2     2 1191
3     3 1040
4     4  945
5     5  877
6     6  825
7     7  784
8     8  750
9     9  721
10   10  696
11   11  674
12   12  655
13   13  638
14   14  622
15   15  608
16   16  595
17   17  583
18   18  572
19   19  562
20   20  553
21   21  544
22   22  535

Creating manhattan plots

Now, let's make a basic manhattan plot.

manhattan(gwasResults)

We can also pass in other graphical parameters. Let's add a title (main=), reduce the point size to 50%(cex=), and reduce the font size of the axis labels to 80% (cex.axis=):

manhattan(gwasResults, main = "Manhattan Plot", cex = 0.5, cex.axis = 0.8)

Let's change the colors and increase the maximum y-axis:

manhattan(gwasResults, col = c("blue4", "orange3"), ymax = 12)

Let's remove the suggestive and genome-wide significance lines:

manhattan(gwasResults, suggestiveline = F, genomewideline = F)

Let's look at a single chromosome:

manhattan(subset(gwasResults, CHR == 1))

Let's highlight some SNPs of interest on chromosome 3. The 100 SNPs we're highlighting here are in a character vector called snpsOfInterest. You'll get a warning if you try to highlight SNPs that don't exist.

str(snpsOfInterest)

 chr [1:100] "rs3001" "rs3002" "rs3003" "rs3004" "rs3005" ...

manhattan(gwasResults, highlight = snpsOfInterest)

We can combine highlighting and limiting to a single chromosome:

manhattan(subset(gwasResults, CHR == 3), highlight = snpsOfInterest, main = "Chr 3")

A few notes on creating manhattan plots:

Run str(gwasResults). Notice that the gwasResults data.frame has SNP, chromosome, position, and p-value columns named SNP, CHR, BP, and P. If you're creating a manhattan plot and your column names are different, you'll have to pass the column names to the chr=, bp=, p=, and snp= arguments. See help(manhattan) for details.
The chromosome column must be numeric. If you have “X,” “Y,” or “MT” chromosomes, you'll need to rename these 23, 24, 25, etc.
If you'd like to change the color of the highlight or the suggestive/genomewide lines, you'll need to modify the source code. Search for col="blue", col="red", or col="green3" to modify the suggestive line, genomewide line, and highlight colors, respectively.

Creating Q-Q plots

Creating Q-Q plots is straightforward - simply supply a vector of p-values to the qq() function. You can optionally provide a title.

qq(gwasResults$P, main = "Q-Q plot of GWAS p-values")

Thursday, March 20, 2014

Visualize coverage for targeted NGS (exome) experiments

I'm calling variants from exome sequencing data and I need to evaluate the efficiency of the capture and the coverage along the target regions.

This sounds like a great use case for bedtools, your swiss-army knife for genomic arithmetic and interval manipulation. I'm lucky enough to be able to walk down the hall and bug Aaron Quinlan, creator of bedtools, whenever I have a "how do I do X with bedtools?" question (which happens frequently).

As open-source bioinformatics software documentation goes, bedtools' documentation is top-notch. In addition, Aaron recently pointed out a work-in-progress bedtools cookbook that he's putting together, giving code examples for both typical and clever uses of bedtools.

Getting back to my exome data, one way to visualize this is to plot the cumulative distribution describing the fraction of targeted bases that were covered by >10 reads, >20 reads, >80 reads, etc. For example, covering 90% of the target region at 20X coverage may be one metric to assess your ability to reliably detect heterozygotes. Luckily for me, there's a bedtools protocol for that.

The basic idea is that for each sample, you're using bedtools coverage to read in both a bam file containing your read alignments and a bed file containing your target capture regions (for example, you can download NimbleGen's V3 exome capture regions here). The -hist option outputs a histogram of coverage for each feature in the BED file as well as a summary histogram for all of the features in the BED file. Below I'm using GNU parallel to run this on all 6 of my samples, piping the output of bedtools to grep out only the lines starting with "all."

Now that I have text files with coverage histograms for all the regions in the capture target, I can now plot this using R.

You can see that sample #2 had some problems. Relative to the rest of the samples you see it take a quick nose-dive on the left side of the plot relative to the others. Running this through Picard showed that 86% of the reads from this sample were duplicates.

Thanks, Aaron, for the tips.

Bedtools protocols: https://github.com/arq5x/bedtools-protocols

Wednesday, November 6, 2013

A Mitochondrial Manhattan Plot

Manhattan plots have become the standard way to visualize results for genetic association studies, allowing the viewer to instantly see significant results in the rough context of their genomic position. Manhattan plots are typically shown on a linear X-axis (although the circos package can be used for radial plots), and this is consistent with the linear representation of the genome in online genome browsers. Many genetic studies often overlook the other genome within all our cells, the mitochondrial genome. This circular molecule has been shown to be associated (albeit inconsistently) with many disease traits, and functional variants from this genome are now included in most genotyping platforms.

Thanks to the clever work of several graduate students in the Vanderbilt Center for Human Genetics Research (most notably Ben Grady), mitochondrial genetic associations can be visualized in the context of key regions of the mitochondrial genome using a radial plot in the R package ggplot2.

To make this plot, download the code and run the script (alternatively open the script in R and run interactively):

On the command line:

git clone git@github.com:stephenturner/solarplot.git
cd solarplot
R CMD BATCH solarplot.R

GitHub: Mitochondrial solar plot code and data

Monday, November 4, 2013

Archival and analysis of #GI2013 Tweets

I archived and analyzed all Tweets containing #GI2013 from the recent Cold Spring Harbor Genome Informatics meeting, using my previously described code.

Friday was the most Tweeted day. Perhaps this was due to Lior Pachter's excellent keynote, "Stories from the Supplement."

There was clearly a bimodal distribution in the Tweets by hour, corresponding to the morning and evening sessions:

Top hashtags used other than #GI2013:

Most prolific users:

And of course, the much-maligned word cloud:

Archive of all #GI2013 Tweets

R code for analysis

This blog has moved!