Wednesday, December 30, 2009

Use plyr instead of _apply() in R

I've covered plyr once before, showing you how to get means and variances for two quantitative traits across multilocus genotypes. JD Long over at Cerebral Mastication recently posted a nice screencast illustrating how plyr "just works" as an alternative to R's family of apply commands.  There's a set of R functions (apply, sapply, lapply, tapply, eapply, and rapply) that can apply a command or function to your data and return a hopefully useful result.  However, for the non-programmers among us, choosing which apply function to use and how to use it can be mind-bogglingly confusing.  I've never gotten one of these functions to work as I wanted it to the first time around, and I often end up writing loops where the vectorized operation would be much faster.

Enter plyr.

As I mentioned previously, the plyr functions (ddply, in particular), are intuitive, usually returning the result that you wanted. The ddply function splits up your dataset based on one or more grouping variables, applies some function or statistic, and summarizes the results returning a dataframe.

Here's JD Long's screencast showing how plyr makes a task like this easy where the apply function fails.

Cerebral Mastication - Struggling with apply() in R

Monday, December 28, 2009

Capture system commands as R objects with system(..., intern=T)

Just discovered this very handy R command to capture the output from a system command as an R object.  I wanted to use R to read in the output from another program (PLINK) and do some processing on each output file. Of course if the files are named sequentially (plink1.out, plink2.out, plink3.out, etc.) this would be simple with a for loop.  This wasn't the case for me, but I could still list all the files I wanted to process with some pattern matching in Unix using wildcards.  All the files had "_hdl" in the name, and ended with ".qassoc".  Here's where the system() function is helpful.

This command issues "ls *_hdl*.qassoc" to the system, just as if you were typing it in at the terminal.  The intern=TRUE tells the system command to treat the output as an R object.  I can store this in myfiles, then do some processing on myfiles, which is a vector containing all the filenames of files I want to process.

myfiles = system("ls *_hdl*.qassoc", intern=TRUE)

 [1] "0-all.01_hdl_modeled.qassoc"
 [2] "0-all.02_hdl_modeled_polyresid.qassoc"
 [3] "0-all.03_hdl_med.qassoc"
 [4] "0-all.04_hdl_med_polyresid.qassoc"
 [5] "0-all.05_hdl_med_smokage_polyresid.qassoc"
 [6] "1-male.01_hdl_modeled.qassoc"
 [7] "1-male.02_hdl_modeled_polyresid.qassoc"
 [8] "1-male.03_hdl_med.qassoc"
 [9] "1-male.04_hdl_med_polyresid.qassoc"
[10] "1-male.05_hdl_med_smokage_polyresid.qassoc"

Tuesday, December 22, 2009

Sync files across multiple computers with Dropbox (PC, Mac & Linux too!)

Do you ever find yourself switching back and forth between your work computer, your laptop, and your home computer?  This happens to me all the time when I'm writing.  Rather than carry all your files on a USB stick and risk losing it or corrupting your data, give Dropbox a try.  It's dead simple, and works for PC, Mac, and Linux too.

Once you sign up and install on all your computers, you'll have a special folder, where if you save something there on one computer, it is automatically created and stays synchronized in the same folder on all your other computers.  What's more, if you use someone else's computer, you can access all your files through a web interface because they're all securely backed up online.  I've been using this for a while now to sync all the papers I'm working on, RefMan/EndNote databases, config files, and R functions I reuse all the time.  You can also create "public" folders.  Put something here, and you can get a direct link to the file online to share with other folks. For example, here's a link to some R code I wrote to use ggplot2 to make manhattan plots and QQ-plots for every PLINK output file in the current directory (I'm hoping to clean this code up and include this with some other functions I've written into a package on CRAN soon).

I can't recommend this little app enough. If you're still not convinced, check out this short video that explains what Dropbox is all about and shows of just how simple it is to use.

You get a whopping 2GB for free, but if you use the registration link provided below, you'll get an extra free 1/4GB.  Happy holidays from GGD, and I'll catch up with you all next week!

Dropbox - Secure online backup and synchronization

Thursday, December 17, 2009

Review: The challenges of sequencing by synthesis

A tip of the hat to a commenter on my previous coverage of a next-gen sequencing paper for pointing out this detailed and perhaps more technically-oriented review on sequencing by synthesis recently published in Nature Biotechnology.  Thanks, Clive.

Review: The challenges of sequencing by synthesis (Nature Biotechnology)

Wednesday, December 16, 2009

Recent improvements to Pubget

If you've never heard of it before, check out my previous coverage on Pubget. It's like PubMed, but you get the PDFs right away.  Pubget has recently implemented a number of improvements.

1. Citation matching.  Pubget's citation matcher seems to work better than Pubmed most of the time.  Try going to Pubget and pasting any of these random citations into the search bar:

J Biol Chem 277: 30738-30745
Nucleic Acids Res 2004;32:4812-20.
Evol. Biol. 7, 214 (2007).

2. The PaperPlane bookmarklet.  Go here and drag the link to your bookmark toolbar.  Now, if you're searching from pubmed, click the bookmarklet for one-click access to the PDF.

3. If you have a long list of PMIDs, separate them with commas and you can paste them directly into the search bar.

Pubget (Vanderbilt institutional link)

Pubget (If you're anywhere else)

Tuesday, December 15, 2009

Seminar announcement: A Multivariate Methodology for Analyzing Genome-wide Association Studies

This looks interesting.

Department of Biostatistics Seminar/Workshop Series: A Multivariate Methodology for Analyzing Genome-wide Association Studies, by Janice Brodsky, PhD, UCLA.

Wednesday, December 16, 1:30-2:30pm, MRB III Conference Room 1220

Intended Audience: Persons interested in applied statistics, statistical theory, epidemiology, health services research, clinical trials methodology, statistical computing, statistical graphics, R users or potential users

In the last few years, high-dimensional genome-wide association (GWA) studies have become a common tool in genetics for investigating which genes are associated with physical traits. However, the results of many GWA studies have fewer genes than expected or even no genes at all. This does not necessarily indicate that there are no genetic associations in the data: genes with weaker associations or which only work in groups will be missed with the standard GWA statistical analysis. We present a multivariate methodology for analyzing GWA data which is designed to handle weaker signals, dependent data, and multicollinearity. We applied this method to a large GWA study, and the results were consistent with previously performed studies. We also discuss extensions of the methodology.

Follow GGD on Twitter @genetics_blog

GGD is now on Twitter! I'll be linking to all of our posts on the Twitter page, and occasionally post something there that may not make its way into a full length post here on the blog. You can follow us on Twitter here @genetics_blog.

Browse R Graphics with the R Graph Gallery and the R Graphical Manual

One of R's biggest strengths is its unparalleled graphing capabilities.  Just see any of our previous posts on ggplot2, visualization, or other posts tagged with R. R has several fundamentally different systems for plotting, including base graphics, lattice, and ggplot2.  Furthermore, many add-on packages come with their own functions for producing problem-domain specific graphics. For example, see GenABEL, a very nice R package for GWAS analysis, which has functions for producing manhattan plots, LD plots, etc.

Now let's say you've seen a certain graphic before, and you want to find the package you need to download and which function you should use to make the plot.  That's where the R Graph Gallery and the R Graphical Manual can become very useful.  Both sites give you thumbnail previews of graphics produced by functions bundled with certain R packages, code for producing the graphic, and which R packages you need to download for the functions used to create the graphic.  The R Graphical Manual is much more comprehensive, and is categorized based on CRAN Task Views (CTV) categories (check out all 29 pages of graphics in the Genetics task view).

R Graphical Manual

R Graph Gallery

Monday, December 14, 2009

Sequencing technologies — the next generation

Following up on last week's coverage of the Genotyping Portal, check out this new review article on next-generation sequencing in Nature Reviews Genetics.  One major focus of this paper is that the next generation of sequencing platforms each use fundamentally different technologies.  Because of this, it's likely that multiple platforms will coexist in the marketplace, and different platforms will have clear advantages over others for particular biological applications.  The paper has some nice figures illustrating how the technology works in sequencing by reversible terminators used by Illumina/Solexa and Helicos BioSciences, emulsion PCR used by Life/APG's SOLiD ligation platform and the Roche/454 Pyrosequencing system, and the highly-anticipated real-time single-molecule sequencing from Pacific Biosciences.  Finally, there's a table giving the pros, cons, biological applications, cost, read length, run time, and references for each of the next-gen sequencing applications.  Finally, a revealing piece of information I found in the last table showing sequencing statistics on personal genomes shows that the sequencing of Stephen Quake's genome with Helicos a few months ago cost only $48,000, a decrease of several orders of magnitude compared to the sequencing of J. Craig Venter's genome (Sanger), which cost an estimated $70,000,000 just a few years ago.

The author of the paper, Michael L. Metzker, is an associate professor of genetics at Baylor College of Medicine, a senior manager at the Human Genome Sequencing Center at Baylor, and President & CEO of LaserGen, Inc., Houston, TX.

Sequencing Technologies - The Next Generation (NRG AOP)

Tuesday, December 8, 2009

Genotyping Portal: A comprehensive (and freely available) online resource about methods for DNA genotyping, screening and sequencing

Diego Forero has compiled a comprehensive list of primary publications on commonly used SNP genotyping and DNA sequencing technologies (including SNP arrays, Sequenom, TaqMan, Pyrosequencing, Molecular Beacons, FP-TDI, Invader, xMAP, SNaPshot, SNPlex, Sanger, 454, Illumina, Helicos, SOLiD, Complete Genomics, Bisulfite sequencing, and others).  Also included here are links to review articles, protocols, and links to manufacturers of reagents and equipment.  Where available, links are included to open access versions of the papers on PubMed Central.

This is an excellent resource for anyone who is generally interested in how these technologies work.  For 2nd year grad students at Vanderbilt, you will be asked about some of these technologies on your qualifying exam!

Genotyping Portal: A comprehensive (and freely available) online resource about methods for DNA genotyping, screening and sequencing

Monday, December 7, 2009

Use PuTTY and XMing to see Linux graphics via SSH on your Windows computer

Do you use SSH to connect to a remote Linux machine from your local Windows computer?  Ever needed to run a program on that Linux machine that displays graphical output, or uses a GUI? I was in this position last week trying to make figures using ggplot2 in R of results from an analysis of GWAS data which required using a 64-bit Linux machine with more RAM than my 32-bit windows machine can see.

You try plotting something in R on a Linux machine in an SSH session you'll get this nasty error message:

Error in function (display = "", width, height, pointsize, gamma, bg,: 
X11 I/O error while opening X11 connection to 'localhost:10.0'

Turns out there's a very easy way to see graphical output over your SSH terminal.  First, if you're not already using PuTTY for SSH, download putty.exe from here.  Next, download, install, and run Xming.  While Xming is running in your system tray, log into the Linux server as you normally would using PuTTY.  Then type this command at the terminal to log into the linux server of your choice (here, pepperjack), with the -X (uppercase) to enable X11 forwarding.

ssh -X

If all goes well you should now be able to use programs that utilize graphical output or interfaces, which are running on the remote Linux machine rather than your local windows computer.

Xming - PC X Server

Xming download link on SourceForge

Tuesday, December 1, 2009

Get Started with Machine Learning in R

A Beautiful WWW put together a great set of resources for getting started with machine learning in R.  First, they recommend the previously mentioned free book, The Elements of Statistical Learning.  Then there's a link to a list of dozens of machine learning and statistical learning packages for R.  Next, you'll need data.  Hundreds of free real datasets are available at the UCI machine learning repository.  Each dataset, such as this breast cancer dataset from Wisconsin, has its own page giving a summary, links to publications of major findings, and detailed descriptions of the variables in the data.  If you want to simulate genetic data, check out our software, genomeSIMLA, capable of simulating gene-gene interactions in case-control and family-based GWAS-sized datasets with realistic patterns of linkage disequilibrium.  If you're interested, check out the genomeSIMLA paper.  Finally, if time is not an issue, consider taking MIT's OpenCourseWare Machine Learning course.  Alternatively, check out Stanford Engineering professor Andrew Ng - all his lectures are available on youtube.  Here's the first lecture.

For more, check out the link below.

A beautiful WWW: Guide to Getting Started in Machine Learning

Monday, November 23, 2009

NYT: SAS threatened by R

The New York Times had an interesting piece yesterday about how SAS is facing several business threats from companies like the recently IBM-acquired SPSS, and from burgeoning interest in open-source software like R.  The NYT ran an entire article about R earlier this year, and this article discusses how SAS has been revamping their technology to work seamlessly with R code in response to R's growing popularity in academia and other research labs.

NYT: At a Software Powerhouse, the Good Life Is Under Siege

NYT Slideshow: At SAS, Taking Care of Employees Is Good Business

Wednesday, November 18, 2009

Cancer Epidemiology, Biostatistics, and Bioinformatics Retreat

The 2009 Cancer Epidemiology, Biostatistics, and Bioinformatics Retreat will be held on Friday, December 4th, 2009, from 1:30 pm to 5:00 pm, on the eighth floor of the VICC building (898B PRB). The purpose of the retreat is to promote interactions among biostatisticians, bioinformaticians, epidemiologists, clinical investigators, and other translational researchers. The retreat will include several slide presentations that provide overviews of research direction and summaries of ongoing epidemiologic, biostatistics, and bioinformatics research at Vanderbilt, as well as over 30 poster presentations. These presentations will cover recent findings from epidemiologic studies; resources for epidemiologic and translational research; and statistical challenges and state-of-the-art methodologies in clinical trials, observational studies, and basic science, as well as bioinformatics and systems biology approaches to studying complex diseases.

Looks like the retreat is free to anyone who registers using the form on the website below.  I'm going, hope to see a few of you there.

Cancer Epidemiology, Biostatistics, and Bioinformatics Retreat

Monday, November 16, 2009

Seminar: Reproducible Research with R, LaTeX, & Sweave

Theresa Scott, instructor of the previously mentioned R workshop and weekly R clinic, is giving a lecture entitled "Reproducible Research with R, LaTeX, & Sweave" in MRB III, room 1220, this Wednesday 11/18 at 1:30.  You can see more details about the lecture here.

Looks like her slides as well as much more introductory material on R, Latex, and Sweave are on her website.

Reproducible Research with R, LaTeX, & Sweave

Monday, November 9, 2009

QQ plots of p-values in R using ggplot2

Way back will wrote on this topic.  See his previous post for Stata code for doing this.  Unfortunately the R package that was used to create QQ-plots here has been removed from CRAN, so I wrote my own using ggplot2 and some code I received from Daniel Shriner at NHGRI.

Of course you can use R's built-in qqplot() function, but I could never figure out a way to add the diagonal using base graphics.  To get the function I wrote to make qqplots, copy and paste this into your R session:

qq = function(pvector, title="Quantile-quantile plot of p-values", spartan=F) {
    # Thanks to Daniel Shriner at NHGRI for providing this code for creating expected and observed values
    o = -log10(sort(pvector,decreasing=F))
    e = -log10( 1:length(o)/length(o) )
    # you could use base graphics
    #plot(e,o,pch=19,cex=0.25, xlab=expression(Expected~~-log[10](italic(p))), ylab=expression(Observed~~-log[10](italic(p))),xlim=c(0,max(e)),ylim=c(0,max(e)))
    #You'll need ggplot2 installed to do the rest
    plot=qplot(e,o, xlim=c(0,max(e)), ylim=c(0,max(o))) + stat_abline(intercept=0,slope=1, col="red")
    if (spartan) plot=plot+opts(panel.background=theme_rect(col="grey50"), panel.grid.minor=theme_blank())

Also, make sure you have ggplot2 installed:


If you already have it installed, load it:


The function takes a vector of p-values, optionally a title, and a third option to make the plot black and white.  You can check the default arguments to the function by typing args(qq).

qq(data$Pvals, title="My Quantile-Quantile Plot")

You can also make the plot more "Spartan" by removing the title (setting it to NULL) and by making the color scheme black on white:

qq(data$Pvals, title=NULL, spartan=TRUE)

Stay tuned - I'm also putting together a way to import your PLINK output files and make better Manhattan plots using R and ggplot2 than you ever could with Haploview.

Update Tuesday, November 10, 2009: A big tip of the hat to the anonymous commenter who pointed out that this can easily be done in the base graphics package, as well as adding confidence intervals.  As always, we welcome your comments if you know of a better or easier way to do anything mentioned here!

Wednesday, November 4, 2009

Split, apply, and combine in R using PLYR

While flirting around with previously mentioned ggplot2 I came across an incredibly useful set of functions in the plyr package, made by Hadley Wickham, the same guy behind ggplot2.  If you've ever used MySQL before, think of "GROUP BY", but here you can arbitrarily apply any R function to splits of the data, or write one yourself.

Imagine you have two SNPs, and 2 different variables you want info about across all three genotypes at each locus.  You can generate some fake data using this command:

mydata=data.frame(X1=rnorm(30), X2=rnorm(30,5,2), SNP1=c(rep("AA",10), rep("Aa",10), rep("aa",10)), SNP2=c(rep("BB",10), rep("Bb",10), rep("bb",10)))

The data should look like this:

            X1          X2 SNP1 SNP2
1  -0.73671602  2.76733658   AA   BB
2  -1.83947190  5.70194282   AA   BB
3  -0.46963370  4.89393264   AA   BB
4   0.85105460  6.37610477   AA   BB
5  -1.69043368  8.11399122   AA   BB
6  -1.47995127  4.32866212   AA   BB
7   0.21026063  4.32073489   AA   BB
8   0.19038088  4.26631298   AA   BB
9  -0.29759084  4.07349852   AA   BB
10 -0.46843672  7.01048905   AA   BB
11  0.93804369  3.78855823   Aa   Bb
12 -1.46223800  5.08756968   Aa   Bb
13 -0.01338604 -0.01494702   Aa   Bb
14  0.13704332  4.53625220   Aa   Bb
15  0.59214151  3.75293124   Aa   Bb
16 -0.27658821  5.78701933   Aa   Bb
17  0.05527139  5.99460464   Aa   Bb
18  1.00077756  5.26838121   Aa   Bb
19  0.62871877  6.98314827   Aa   Bb
20  1.18925876  8.61457227   Aa   Bb
21 -0.40389888  3.36446128   aa   bb
22 -1.43420595  6.50801614   aa   bb
23  1.79086285  5.39616828   aa   bb
24  0.15886243  3.98010396   aa   bb
25  0.57746024  4.93605364   aa   bb
26 -1.11158987  6.40760662   aa   bb
27  1.93484117  3.33698750   aa   bb
28  0.10803302  4.56442931   aa   bb
29  0.56648591  4.48502267   aa   bb
30  0.03616814  5.77160936   aa   bb

Now let's say you want the mean and standard deviation of X1 and X2 across each of the multilocus genotypes at SNP1 and SNP2.  First, install and load the plyr package:


Now run the following command:

ddply(mydata, c("SNP1","SNP2"), function(df) data.frame(meanX1=mean(df$X1), sdX1=sd(df$X1), meanX2=mean(df$X2), sdX2=sd(df$X2)))

And you get exactly what you asked for:

  SNP1 SNP2     meanX1      sdX1   meanX2     sdX2
1   aa   bb  0.2223019 1.0855063 4.875046 1.144623
2   Aa   Bb  0.2789043 0.7817727 4.979809 2.286914
3   AA   BB -0.5730538 0.8834038 5.185301 1.601626

That command above may appear complicated at first, but it isn't really.  Cerebral mastication gives a very easy to follow, concise explanation of what that command is doing.  Also, check out the slides he presented at a recent R meetup discussion:

There's more to plyr than ddply, so check it out for yourself!

Monday, November 2, 2009

Common disorders are quantitative traits

There are no common disorders - only extremes of quantitative traits.


That's the argument made by Plomin, Haworth, and Davis in a great review paper just published online in Nature Reviews Genetics.  One of the main themes presented here is that as a disorder becomes more common, a case-control design becomes less and less powerful to identify associated genetic variants because cases become less extreme and the control group becomes increasingly contaminated by "near cases".

Some qualitative disorders have obvious relevant quantitative traits - BMI for obesity, blood pressure for hypertension,  mood for depression.  I recently authored a review with Dana and Marylyn making a similar argument in the context of pharmacogenomics research.  The authors admit, however, that quantitative measurements are not at all apparent for some disorders, such as alcoholism, arthritis, autism, dementia, diabetes, or heart disease.

The review also has a glossary for the uninitiated, encouraging the use of quantitative vocabulary, like linear regression or ANOVA instead of logistic regression, variance and mean differences rather than risk, and covariance rather than comorbidity.

The conclude with statement that the extremes of a distribution are important medically, but there is no scientific or statistical advantage in analyzing artificially constructed disease labels that evolved historically on the basis of symptoms rather than etiology.

Common disorders are quantitative traits (NRG AOP).

Monday, October 19, 2009

Annotated LD-Plots

Often when presenting statistics from a candidate gene study, or a region of interest from a genome-wide association study, it is useful to see various SNP-wise values in the context of linkage disequilibrium patterns. To aid the interpretation of regional association results, we developed a tool called LD-Plus that presents a familiar Haplo-view style correlation plot that is annotated with haplotypes, haplotype blocks, and haplotype frequencies, along with continuous and categorical SNP statistics. An example plot is shown below (click for the larger version).

This plot design displays p-values from association or other types of statistics in the context of quality control measures such as genotyping efficiency, HWE p-values, and minor allele frequency. It also displays association signals relative to LD structure, so the user can clearly see if multiple SNPs in the same LD block show evidence for association. The haplotypes and their relative frequencies are shown along the top of the figure, and gray shading carries the haplotype structure up through the other statistics in the figure. Categorical stats, such as inclusion on genotyping platforms, whether a SNP was imputed or typed, or the consequence of the SNP (coding/non-coding) can be displayed also.

LD-Plus is available via a web interface, and is an evolving project. To give it a try, visit Details on how to properly format input files is available under the "Documentation" link on the site. We'd love your feedback on the design, and what other features we might include in the design!

Thursday, October 15, 2009

Check out the GGD poster at ASHG!

Aloha! If you're going to the American Society of Human Genetics meeting next week, come by our poster (poster #116, abstract #1567/T) and say hello! We would love to meet any of you who read GGD and hear what you think and what you'd like to see in the future.

Wednesday, October 14, 2009

Free Book: Elements of Statistical Learning

The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani, and Jerome Friedman, one of the best books on data mining and machine learning, is now available free in PDF format.

Download it here or view it online here.

Friday, October 9, 2009

Visualizing sample relatedness in a GWAS using PLINK and R

Strict quality control procedures are extremely important for any genome-wide association study.  One of the first steps you should take when running QC on your GWAS is to look for related samples in your dataset.  This does two things for you.  First, you can get an idea of how many related samples you have in your dataset, and second, if you have access to self-report relationship information, you can identify potential sample mix-ups based on discrepancies between genetic information and self-report.

PLINK allows you to estimate genomewide IBD-sharing coefficients between seemingly unrelated individuals from whole-genome data.  You can read lots more about the particular method they use to infer IBD given IBS and allele frequencies in the PLINK paper.  It's relatively straightforward to compute these estimates.  If you already have your data in pedfile format, use the following command (assuming mydata.ped and are in the current directory where you run PLINK).

plink --file mydata --genome --min 0.05

This will usually take a few days if you use genome-wide data.  You could speed this up by first thinning your dataset to around 100,000 markers using the --thin option to create a smaller dataset.

After you've run this you'll have a file in that directory called plink.genome.  You can read about the output here, but we're really interested in Z1 and Z0, the proportion of markers identical by descent 1 and 0 respectively, for every pair of individuals in the dataset.  You can plot Z1 by Z0 and color code them by the relationship type as coded in the pedfile using this R code.

Here's what the columns of interest will look like:

  RT     Z1     Z0
1 UN 0.1697 0.8294
2 OT 0.5034 0.4879
3 OT 0.4403 0.5439
4 OT 0.5200 0.4674
5 UN 0.1646 0.8311
6 OT 0.5519 0.4481

First, fire up R, go to the directory where your plink results are located and load in the data:

d = read.table("plink.genome", header=T)

Next, set up the plot area. The par option makes the points solid instead of an open circle, and the plot command sets up your plot and axes without adding any points.

with(d,plot(Z0,Z1, xlim=c(0,1), ylim=c(0,1), type="n"))

Your plot should be empty, like this:

Now use this command to plot points where the relationship type is FS, or full sibling.  Make the points green (col=3).

with(subset(d,RT=="FS") , points(Z0,Z1,col=3))

Do the same thing with half sibs, "other related" (have the same family ID), parent offspring pairs, and unrelated, making each of them a different color.

with(subset(d,RT=="HS") , points(Z0,Z1,col="darkorange"))
with(subset(d,RT=="OT") , points(Z0,Z1,col=4))
with(subset(d,RT=="PO") , points(Z0,Z1,col=2))
with(subset(d,RT=="UN") , points(Z0,Z1,col=1))

You might want to rerun the commands for "other related" and especially half sibs, because they were covered up when you plotted unrelateds.

with(subset(d,RT=="OT") , points(Z0,Z1,col=4))
with(subset(d,RT=="HS") , points(Z0,Z1,col="darkorange"))

Finally, add a legend

legend(1,1, xjust=1, yjust=1, legend=levels(d$RT), pch=16, col=c(3,"darkorange",4,2,1))

For a recap, your code should look like this:

d = read.table("plink.genome", header=T)

with(d,plot(Z0,Z1, xlim=c(0,1), ylim=c(0,1), type="n"))

with(subset(d,RT=="FS") , points(Z0,Z1,col=3))
with(subset(d,RT=="HS") , points(Z0,Z1,col="darkorange"))
with(subset(d,RT=="OT") , points(Z0,Z1,col=4))
with(subset(d,RT=="PO") , points(Z0,Z1,col=2))
with(subset(d,RT=="UN") , points(Z0,Z1,col=1))
with(subset(d,RT=="OT") , points(Z0,Z1,col=4))
with(subset(d,RT=="HS") , points(Z0,Z1,col="darkorange"))

legend(1,1, xjust=1, yjust=1, legend=levels(d$RT), pch=16, col=c(3,"darkorange",4,2,1))

What you can see from the above plot is that self-reported parent-offspring pairs share 100% of their alleles IBD=1.  There is a single sib pair at the bottom left where Pr(IBD=1)=0 and Pr(IBD=0)=0.  This means that this sib pair shares 2 alleles identical by descent at every locus across the genome.  This is either a duplicated sample or a set of identical twins.  The full sibs cluster right around the middle of the plot, and you have lots of unrelated or "other related" pairs stretching from completely unrelated (bottom right quadrant) to relatively closely related (2nd or 3rd degree relatives, right around the middle of the plot).  If you see self-reported sibs showing up where the unrelated individuals cluster on this plot, or vice versa, this should clue you in on potential sample identity problems.


Just for fun, let's make the same plot using ggplot2.  Make sure you have the data loaded as above, then install ggplot2 if you haven't already


You only have to install the package once.  Next load the package:


Use this incredibly simple command to make the plot:


I'm not a huge fan of these colors, so lets make them the same as above:

qplot(Z0,Z1,data=d,colour=RT) + scale_colour_manual(value=c(3,"darkorange",4,2,1))

That's a little better.  Obviously the code to do this is MUCH simpler in ggplot2.  The only problem I have with this is that it automatically plots the points in the order of the levels of RT, ie. FS, then HS, then OT, PO, then UN, so you end up covering up all your half-sibs with "other related" and unrelated pairs.  Post a comment if you know how to go back and replot the HS points over the UN points.  For more on ggplot2, see my previous post comparing histograms made using Stata, R base, R lattice and R ggplot2.

Tuesday, October 6, 2009

R Commander: A Basic Statistics GUI for R

R is a great tool with lots of resources for genetics, genome-wide association studies, and many other biological applications.  We've covered several places to find help in R in the past, but if you're still apprehensive about diving into R's command-line interface, fear not.  The R commander is a graphical user interface (GUI) for R that works under Windows, Linux, and Mac.  It features a data editor that looks a lot like an Excel spreadsheet or Stata's data editor.  It also has menus and clicky-boxes to do lots of basic analyses and for producing graphics without ever typing a single command.

R commander depends on lots of other packages to function properly, so to install it along with all it's dependencies, type this in at the R command line:

install.packages("Rcmdr", dependencies=TRUE)

Note that this may take 5-10 minutes depending on your connection since it's downloading lots of extra packages, but you only need to do that once.  Now that it's installed, any time you want to load it, open R and type:


If you're familiar with Stata or SPSS you should find the menus and dialog boxes intuitive and self-explanatory. I've put in a couple screenshots below showing R commander's data editor, GLM tool, and graphics menu.

There are a few other GUIs for R that I've never used, including Rattle (the R Analytical Tool To Learn Easily), and JGR (Java Gui for R).  If you prefer one of these to Rcmdr, let us here why in the comments!

The R Commander: A Basic-Statistics GUI for R

Friday, October 2, 2009

Genomics to Confirm Nationality?

Thanks to Kylee for pointing out this frightening trend from the U.K. Border Agency.

Essentially, the United Kingdom is experimenting with using mitochondrial and Y-chromosome markers to determine if asylum-seekers that flee persecution are actually from the nation in question.

Advocates of this policy don't seem to understand that genetic variation is not nation-specific. Just as you cannot "look" at a person and determine their nationality, you cannot look at their DNA to make that determination either.

Friday, September 25, 2009

What happens when a consumer genetics company goes bankrupt?

Dan Vorhaus and Lawrence Moore recently put together this excellent three part series on Genomics Law Report.  Headlines about deCODE Genetics on the brink of insolvency and major shifts in the upper management of 23andMe inspired this series of posts on what would happen when a direct-to-consumer (DTC) genomics company goes declares bankruptcy.

Bankruptcy law authorizes the sale of the assets of a business in bankruptcy, and genomic data is likely the most valuable asset of any DTC genomics company.  First the authors dissect the privacy policy and terms of service for three major DTC companies: 23andMe, deCODE Genetics, and TruGenetics.  Next there's a discussion of how the legal system would treat a DTC genomics company's bankruptcy.  The series wraps up with a brief discussion of how this ultimately affects the average DTC genomics cutomer.

Genomics Law Report: What happens if a DTC Genomics Company Goes Belly-Up?

Wednesday, September 23, 2009

JBrowse: a JavaScript Based Genome Browser

Genome Browsers are nothing new, but JBrowse is a new JavaScript based genome browser that uses information from the UCSC genome browser and has the look and feel of Google Maps.  It's extremely easy to zoom in and out and scroll around because all the "work" is being done by your computer rather than some server farm thousands of miles away.  OpenHelix is calling it a gamechanger, and they have a nice video demonstration showing off some of JBrowse's features.  Click the Drosophila or Homo sapiens genome and give JBrowse a spin for yourself!

The JBrowse genome browser

Monday, September 21, 2009

Comparison of plots using Stata, R base, R lattice, and R ggplot2, Part I: Histograms

One of the nicer things about many statistics packages is the extremely granular control you get over your graphical output.  But I lack the patience to set dozens of command line flags in R, and I'd rather not power the computer by pumping the mouse trying to set all the clicky-box options in Stata's graphics editor.  I want something that just looks nice, using the out-of-the-box defaults.  Here's a little comparison of 4 different graphing systems (three using R, and one using Stata) and their default output for plotting a histogram of a continuous variable split over three levels of a categorical variable.

First I'll start with the three graphing systems in R: base, lattice, and ggplot2.  If you don't have the last two packages installed, go ahead and download them:


Now load these two packages, and download this fake dataset I made up containing 100 samples each from three different genotypes ("geno") and a continuous outcome ("trait")


Now let's get started...

R: base graphics


R: lattice

histogram(~trait | factor(geno), data=mydat, layout=c(1,3))

R: ggplot2


# Update Tuesday, September 22, 2009
# A commenter mentioned that this code did not work.
# If the above code does not work, try explicitly
# stating that you want a histogram:


insheet using "", comma clear
histogram trait, by(geno, col(1))


In my opinion ggplot2 is the clear winner.  Again I'll concede that all of the above graphing systems give you an incredible amount of control of every aspect of the graph, but I'm only looking for what gives me the best out-of-the-box default plot using the shortest command possible. R's base graphics give you a rather spartan plot, with very wide bins.  It also requires 4 lines of code.  (If you can shorten this, please comment).  By default, the base graphics system gives you counts (frequency) on the vertical axis.  The lattice package in R does a little better perhaps, but the default color scheme is visually less than stellar.  Also, I'm not sure why the axis labels switch sides every other plot, and the ticks on top of the plot are probably unnecessary.  I still think the bins are too wide.  You lose some information especially on the bottom plot towards the right tail.  The vertical axis is proportion of total.  Stata's default plot looks very similar to lattice, but again uses a very unattractive color scheme.  It uses density for the vertical axis, which may not mean much to non-statisticians.  The default plot made by ggplot2 is just hands-down good-looking.  There are no unnecessary lines delimiting the bins, and the binwidth is appropriate.  The vertical axis represents counts.  The black bars on the light-gray background have a good data-ink ratio.  And it required the 2nd shortest command, only 3 characters longer than the Stata equivalent.

I'm ordering the ggplot2 book (Amazon, ~$50), so as I figure out how to do more with ggplot2 I'll post more comparisons like this.  If you use SPSS, SAS, MATLAB, or something else, post the code in a comment here and send me a picture or link to the plot and I'll post it here.

Wednesday, September 16, 2009

PCG Journal Club Articles, 9/11

There were only a couple of citations for articles discussed at this week's PCG meeting (September 11). Our next meeting is scheduled for September 26.


Kim S, Xing EP. Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genet. 2009 Aug; 5(8):e1000587.

Zamar D, Tripp B, Ellis G, Daley D. Path: a tool to failitate pathway-based genetic association analysis. Bioinformatics. 2009 Sep 15; 25(18):2444-6

R clinic this week: Regression Modeling Strategies in R

At this week's R clinic Frank Harrell will unveil the new rms (Regression Modeling Strategies) package that is a replacement for the R Design package.  He will demonstrate the differences with Design, especially related to enhanced graphics for displaying effects in regression models.  Frank will also discuss the implementation of quantile regression in rms.  The rms package website has links to the manual, examples of graphical output, and printable reference cards for many of the package's commands.  It also makes a point that many of rms's graphics capabilities are modular and will play nicely with previously mentioned ggplot2.

To install the rms package, start R and type:

install.packages("rms", dependencies=TRUE)

Then to load it any time thereafter,


The R clinic is held by the Vanderbilt biostatistics department every Thursday 2-3pm and free to anyone who wants to attend.  More information here.

Monday, September 14, 2009

Find the function you're looking for in R

Any R user no matter what level of experience has had trouble finding the package or the function to do what you want to do and then figuring out how to use it.  The sos package in R just made that a lot easier.

First, fire up R, then install the sos package (don't omit the quotes):


It'll ask you to choose a mirror.  Choose the closest one.  After it installs, load the package (omit the quotes this time):


This loaded all the functions that come with the sos package, including a particularly useful one called findFn.  It scans the "function" entries in Jonathan Baron's "R site search" database.  Give it a try, using "epistasis" with the quotes as the keyword.


This should open up a web browser that displays relevant functions, the package you need to download (using the above procedure) to use the function, and a link to the help page for that function.

You can also use ??? as an alias for findFn.  Try it like this (use the quotes):

???"genome wide"

Once you have the sos package installed, type vignette("sos") for more information on how to use various functions in this package.

If you still can't find what you're looking for, check out my previous post on finding help on R, and if all else fails, don't forget about Theresa Scott's free weekly R clinic / Q&A sessions.

Thursday, September 10, 2009

Machine Learning in R

Revolutions blog recently posted a link to R code by Joshua Reich with self-contained examples of using machine learning techniques in R, including various clustering methods (k-means, nearest neighbor, and kernel), recursive partitioning (CART), principle components analysis, linear discriminant analysis, and support vector machines.  This post also links to some slides that go over the basics of machine learning.  Looks like a good place to start learning about ML before handrolling your own code.

Be sure to check out one of Will's previous post on hierarchical clustering in R.

Revolutions: Machine learning in R, in a nutshell

Wednesday, September 9, 2009

Sync your home directories on ACCRE and the local Linux servers (a.k.a. "the cheeses")

Vanderbilt ACCRE users with PCs only...

If you use ACCRE to run multi-processor jobs you'll be glad to know that they now allow you to map your home directory to your local desktop using Samba (so you can access your files through My Computer as you normally would with local files).  Just submit a help request on their website and they'll get you set up.

Now if you have both your ACCRE home and your home on the cheeses mapped, you can easily sync the files between the two.  Download Microsoft's free SyncToy to do the job.  It's pretty dead simple to set up, and one click will synchronize files between the two servers.

I didn't want to synchronize everything, so I set it up to only sync directories that contain perl scripts and other programs that I commonly use on both machines.  SyncToy also seems pretty useful for backing up your files too.

Microsoft SyncToy

Ask ACCRE to let you map your home

Tuesday, September 8, 2009

Get the full path to a file in Linux / Unix

In the last post I showed you how to point to a file in windows and get the full path copied to your clipboard.  I wanted to come up with something similar for a Linux environment.  This is helpful on Vampire/ACCRE because you have to fully qualify the path to every file you use when you submit jobs with PBS scripts. So I wrote a little perl script:

print "$pwd\n" if @ARGV==0;
foreach (@ARGV) {print "$pwd/$_\n";}

You can copy this from me, just put it in your bin directory, like this:

cp /home/turnersd/bin/path ~/bin

Make it executable, like this:

chmod +x ~/bin/path

Here it is in action. Let's say I wanted to print out the full path to all the .txt files in the current directory.  Call the program with arguments as the files you want to print the path to:

[turnersd@vmps21 replicated]$ ls
[turnersd@vmps21 replicated]$ path *.txt

Sure, it's only a little bit quicker than typing pwd, copying that, then spelling out the filenames.  But if you have long filenames or lots of filenames you want to copy, this should get things done faster.  Enjoy.

Friday, September 4, 2009

ClipPath copies filename and path from windows for loading into R

I wish I would have discovered this long ago.  Loading data into R or MySQL requires you to specify the full path to the file.  If you do this on a Windows machine there are two annoyances.  First, if you save something to your desktop the path to your desktop is really long.  Second, windows by default uses backslashes "\" in the file path, while R or other software requires forward slashes "/".  ClipPath is a tiny program that adds an entry to your right-click menu to copy the full file path with a forward slash, then you can paste the filename into whatever program you're using.

Download the zipfile from the website below (here's a direct link to the zip file).  Extract it's contents, right click on ClipPath.inf, and choose install.  You can always uninstall later through the control panel.

ClipPath Shell Extension

Thursday, September 3, 2009

GGD posts now printer-friendly

A quick announcement about a formatting fix here - you can now print posts from GGD a little more cleanly.  When you print it should no longer include the title and sidebar, so most posts should now only use a page or two.

Wednesday, September 2, 2009

PCG Journal Club Articles, 8/28

Here are citations for the articles discussed at our most recent meeting (August 28). I have also appended a link to Nature Reviews: Genetics new series, Fundamental concepts in genetics, at the end. Our next meeting is scheduled for September 11.


Gurwitz D, Fortier I, Lunshof JE, Knoppers BM. Research ethics: Children and population biobanks. Science. 2009 Aug 14; 325(5942):818-9

Hastings PJ, Lupski JR, Rosenberg SM, Ira G. Mechanisms of change in gene copy number. Nat Rev Genet. 2009 Aug; 10(8):551-64.

Holsinger KE, Weir BS. Genetics in geographically structured populations: defining, estimating and interpreting F(ST). Nat Rev Genet. 2009 Sep; 10(9):639-50.

Koga Y, Pelizzola M, Cheng E, Krauthammer M, Sznol M, Ariyan S, Narayan D, Molinaro AM, Halaban R, Weissman SM. Genome-wide screen of promoter methylation identifies novel markers in melanoma. Genome Res. 2009 Aug; 19(8):1462-70.

Monier A, Pagarete A, de Vargas C, Allen MJ, Read B, Claverie JM, Ogata H. Horizontal gene transfer of an entire metabolic pathway between a eukaryotic alga and its DNA virus. Genome Res. 2009 Aug; 19(8):1441-9.

Raveh-Sadka T, Levo M, Segal E. Incorporating nucleosomes into thermodynamic models of transcription regulation. Genome Res. 2009 Aug; 19(8):1480-96.

Rosenberg NA, Vanliere JM. Replication of genetic associations as pseudoreplication due to shared genealogy. Genet Epidemiol. 2009 Sep; 33(6):479-87.

Schmitz D, Netzer C, Henn W. An offer you can't refuse? Ethical implications of non-invasive prenatal diagnosis. Nat Rev Genet. 2009 Aug; 10(8):515.

Fundamental concepts in genetics:

Friday, August 28, 2009

Convert PLINK output to CSV

I tip my hat to Will for showing me this little command line trick. PLINK's output looks nice when you print it to the screen, but it can be a pain to load the output into excel or a MySQL database because all the fields are separated by a variable number of spaces. This little command line trick will convert a variable-space delimited PLINK output file to a comma delimited file.

You need to be on a Linux/Unix machine to do this. Here's the command. I'm looking at results from Mendelian errors here. Replace "mendel" with the results file you want to reformat, and put this all on one line.

cat mendel.txt | sed -r 's/^\s+//g' | sed -r 's/\s+/,/g' > mendel.csv

You'll have created a new file called results.hwe.csv that you can now open directly in Excel or load into a database more easily than you could with the default output.


turnersd@provolone~: cat mendel.txt
1089 16223073 16223062 1 149
1116 16233564 16233589 1 114
123 16230767 16230725 2 189
12 16221778 16221803 1 116
12 16221805 16221822 1 98
12 16230486 16230496 1 76
12 16231205 16232111 2 180
134 16222939 16222945 2 140
1758 16230755 16231121 2 206


turnersd@provolone~: cat mendel.csv

If you're interested in the details of what this is doing here you go:

First, you cat the contents of the file and pipe it to a command called sed. The thing between the single quotes in the sed command is called a regular expression, which is similar to doing a find-and-replace in MS Word. What this does is searches for the thing between the first pair of slashes and replaces it with the thing between the next two slashes. You need the -r option, and the "s" before the first and the "g" after the last slash to make it work right.

/^\s+// is the first regular expression. \s is special and it means means search for whitespace. \s+ means search for any amount of whitespace. The ^ means only look for it at the beginning of the line. Notice there is nothing between the second and third slashes, so it will replace any whitespace with nothing. This part will trim any whitespace from the beginning of the line, which is important because in the next part we're turning any remaining whitespace into a comma, so we don't want the line to start with a comma.

/\s+/,/ is the second regular expression. Again we're searching for a variable amount of whitespace but this time replacing it with a comma.

Tuesday, August 25, 2009

Great Quote from Our Own Doug Mortlock

'It’s stunning to see a genetic modification like this,' developmental geneticist Douglas Mortlock of Vanderbilt University in Nashville, Tenn., says of the new study, published online July 16 in Science. 'This is the gene that makes wiener dogs short-legged.'

Thursday, August 13, 2009

Beautiful Info-graphic

While not directly related to genetics, this is an excellent example of well-designed data representation. The New York Times reports the results of a survey of average time spent on various activities through the day by different groups of people.

The graphic is essentially a stacked density plot with time (24 hours) on the X-axis. Clicking on a different group of individuals provides a very smooth transition to the new density distribution, allowing an animated visual comparison. In some ways, this animated version provides an easier comparison than showing multiple versions of the same figure. Furthermore, there is just something compelling about this figure that begs you to examine it more closely...

Next-Gen Sequencing

Logan recently emailed me an article in the New York Times about single-molecule DNA sequencing and I realized I knew next to nothing about the new and emerging technology that will change the way we do association studies (that is, if we're still even trying to find genetic associations in the first place). The Wellcome Trust posted a news feature a few weeks back giving brief explanations and short videos on DNA sequencing, starting with the old Sanger method, then the second generation 454 and Illumina (Solexa) technologies. They also give a quick overview and and link to some of the 3rd generation technologies in the pipeline, including Pac Bio, Oxford Nanopore, and Complete Genomics.

Wellcome Trust feature on next-gen sequencing

Wednesday, August 12, 2009

Systems Biology Graphical Notation

The Systems Biology Graphical Notation (SBGN) project is an effort to standardize the graphical notation used in diagrams of pathways, biochemical processes, and cellular processes studied in systems biology.

SBGN defines a comprehensive set of symbols with precise semantics, together with detailed syntactic rules defining their use and how diagrams are to be interpreted. By standardizing the visual notation, SBGN can serve as a bridge between different communities in research, education, publishing, and more. The real payoff will come when researchers are as familiar with the notation as electronics engineers are familiar with the notation of circuit schematics. If researchers are saved the time and effort required to familiarize themselves with different notations, they can spend more time thinking about the biology being depicted.

You can view the project info here.

Tuesday, August 11, 2009

ggplot2 workshop at Vanderbilt

Hadley Wickham, creator of the previously mentioned R plotting system ggplot2 and author of a forthcoming book from Springer, is teaching a workshop in data visualization using R, ggplot2, and GGobi. Unfortunately this workshop conflicts with IGES and ASHG this year, but he mentioned the possibility of holding a workshop here at Vanderbilt if there is enough interest. Leave a comment or email me if you'd be interested in attending this workshop if it is held at Vanderbilt.

Saturday, August 8, 2009

PCG Journal Club Articles, 7/31

Here are citations for the articles discussed at our most recent meeting (July 31). Our next meeting is scheduled for August 14.

Bogdanowicz W, Allen M, Branicki W, Lembring M, Gajewska M, Kupiec T. Genetic identification of putative remains of the famous astronomer Nicolaus Copernicus. Proc Natl Acad Sci U S A, 2009 Jul 7 [Epub ahead of print].

Green RC, Roberts JS, Cupples LA, Relkin NR, Whitehouse PJ, Brown T, Eckert SL, Butson M, Sadovnick AD, Quaid KA, Chen C, Cook-Deegan R, Farrer LA; REVEAL Study Group. Disclosure of APOE genotype for risk of Alzheimer's disease. N Engl J Med, 2009 Jul 16; 361(3):245-54.

International Schizophrenia Consortium, Purcell SM, Wray NR, Stone JL, Visscher PM, O'Donovan MC, Sullivan PF, Sklar P. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature, 2009 Aug 6; 460(7256):748-52.

Pevzner P, Shamir R. Computing has changed biology--biology education must catch up. Science, 2009 Jul 31; 325(5940):541-2. No abstract available.

Pomerantz MM, Ahmadiyeh N, Jia L, Herman P, Verzi MP, Doddapaneni H, Beckwith CA, Chan JA, Hills A, Davis M, Yao K, Kehoe SM, Lenz HJ, Haiman CA, Yan C, Henderson BE, Frenkel B, Barretina J, Bass A, Tabernero J, Baselga J, Regan MM, Manak JR, Shivdasani R, Coetzee GA, Freedman ML. The 8q24 cancer risk variant rs6983267 shows long-range interaction with MYC in colorectal cancer. Nat Genet, 2009 Aug; 41(8):882-4.

Rosser ZH, Balaresque P, Jobling MA. Gene conversion between the X chromosome and the male-specific region of the Y chromosome at a translocation hotspot. Am J Hum Genet, 2009 Jul; 85(1):130-4.

Tuupanen S, Turunen M, Lehtonen R, Hallikas O, Vanharanta S, Kivioja T, Björklund M, Wei G, Yan J, Niittymäki I, Mecklin JP, Järvinen H, Ristimäki A, Di-Bernardo M, East P, Carvajal-Carmona L, Houlston RS, Tomlinson I, Palin K, Ukkonen E, Karhu A, Taipale J, Aaltonen LA. The common colorectal cancer predisposition SNP rs6983267 at chromosome 8q24 confers potential to enhanced Wnt signaling. Nat Genet, 2009 Aug; 41(8):885-90. Epub 2009 Jun 28.

Wain LV, Armour JA, Tobin MD. Genomic copy number variation, human health, and disease. Lancet, 2009 Jul 25; 374(9686):340-50. Review.

Thursday, August 6, 2009

For Today’s Graduate, Just One Word: Statistics

That's the title of a good article published yesterday in the New York Times about the emergence of statistics being in huge demand in the career market, becoming "the sexy job in the next 10 years" as Google's chief economist puts it. Now I just need to find one of these don't drink and derive t-shirts...

For Today’s Graduate, Just One Word: Statistics
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.