Getting Genetics Done: November 2010

Tuesday, November 30, 2010

Abstract Art with PubMed2Wordle

While preparing for my upcoming defense, I found a cool little web app called pubmed2wordle that turns a pubmed query into a word cloud using text from the titles and abstracts returned by the query. Here are the results for a pubmed query for me ("turner sd AND vanderbilt"):

And quite different results for where I'm planning to do my postdoc:

Looks useful to quickly get a sense of what other people work on.

http://www.pubmed2wordle.appspot.com/

Monday, November 29, 2010

Defense!

On Friday, December 3rd, at 8:00 AM, after copious amounts of coffee, my friend, colleague, and perpetual workout buddy Stephen Turner will defend his thesis.

Knowledge-Driven Genome-wide Analysis of Multigenic Interactions Impacting HDL Cholesterol Level

Join us in room 206 of the Preston Research Building at Vanderbilt Medical Center for the auspicious occasion!

Wednesday, November 24, 2010

How to Not Suck at Powerpoint, and the Secrets of Malcolm Gladwell

Designer Jesse Dee has an entertaining presentation on Slideshare about how to use Powerpoint effectively (although Edward Tufte may assert that such a thing is impossible). These are all things we probably know, but just don't take into consideration enough when we're giving a presentation.

According to Dee, the number one most common mistake is lack of preparation. According to a survey taken by a company that specializes in presentation skills coaching, 86% of executives say that presentation skills affect their career and income, yet only 25% spend more than 2 hours of prep for "high-stakes" presentations. Malcolm Gladwell, author of Blink, Tipping Point, Outliers, and an excellent article on drug development in the New Yorker ("The Treatment: Why is it so Difficult to Develop Drugs for Cancer?"), is known for delivering a seemingly effortless presentation, ending at exactly the right time, without ever looking at his watch (see his Ted Talk on Spaghetti Sauce). When Financial Times writer Gideon Rachman asked how he does it, Gladwell responded, "I know it may not look like this. But it’s all scripted. I write down every word and then I learn it off by heart. I do that with all my talks and I’ve got lots of them." I've had lots of folks tell me the best way to give a talk is to throw up a few main points and wing it through an hour, but perhaps rote memorization is a more attractive alternative. But in our line of work where we show lots of data, tables, figures, and statistics, it's already easy enough to bore your audience, and delivering a memorized speech might make this worse. I tend to prefer something in between complete improv and autopilot. What are your favorite tips for presenting scientific, quantitative, or statistical data and results?

You Suck at Powerpoint (Slideshare, via @lifehacker)

Financial Times - Secrets of Malcolm Gladwell

Malcolm Gladwell - The Treatment: Why is it so Difficult to Develop Drugs for Cancer?

Tuesday, November 23, 2010

Randomly Select Subsets of Individuals from a Binary Pedigree .fam File

I'm working on imputing GWAS data to the 1000 Genomes Project data using MaCH. For the model estimation phase you only need ~200 individuals. Here's a one-line unix command that will pull out 200 samples at random from a binary pedigree .fam file called myfamfile.fam:

for i in `cut -d ' ' -f 1-2  myfamfile.fam | sed s/\ /,/g`; do echo "$RANDOM $i"; done | sort |  cut -d' ' -f 2| sed s/,/\ /g | head -n 200

Redirect this output to a file, and then run PLINK using the --keep option with this new file.

Wednesday, November 17, 2010

Syntax Highlighting R Code, Revisited

A few months ago I showed you how to syntax-highlight R code using Github Gists for displaying R code on your blog or other online medium. The idea's really simple if you use blogger - head over to gist.github.com, paste in your R code, create a public "gist", hit "embed", then copy the javascript onto your blog. However, if you use a hosted wordpress.com or other blogging platform that doesn't allow javascript tags within a post, you can't use this method.

While I still prefer using Github Gists for archiving and version control, there's an alternative that works where javascript doesn't. Inside-R has a nice tool - Pretty R Syntax Highlighter - where you simply paste in your R code, and it generates HTML code that syntax-highlights your R code. What's more, functions in your R code link back to the documentation on inside-R.org. Here's an example of some code I posted a while back on making QQ plots of p-values using R base graphics.

Without any highlighting, it's hard to read, and spacing isn't faithfully preserved:

# Define the function
ggd.qqplot = function(pvector, main=NULL, ...) {
    o = -log10(sort(pvector,decreasing=F))
    e = -log10( 1:length(o)/length(o) )
    plot(e,o,pch=19,cex=1, main=main, ...,
        xlab=expression(Expected~~-log[10](italic(p))),
        ylab=expression(Observed~~-log[10](italic(p))),
        xlim=c(0,max(e)), ylim=c(0,max(o)))
    lines(e,e,col="red")
}

#Generate some fake data that deviates from the null
set.seed(42)
pvalues=runif(10000)
pvalues[sample(10000,10)]=pvalues[sample(10000,10)]/5000

# pvalues is a numeric vector
pvalues[1:10]

# Using the ggd.qqplot() function
ggd.qqplot(pvalues)

# Add a title
ggd.qqplot(pvalues, "QQ-plot of p-values using ggd.qqplot")

.......
Here's the same code embeded using Github Gists:

And the same code using Inside-R.org's Pretty R Syntax Highlighter. Note that function calls are hyperlinks to the function's documentation on inside-R.

# Originally posted at http://gettinggeneticsdone.blogspot.com/2010/07/qq-plots-of-p-values-in-r-using-base.html
 
# Define the function
ggd.qqplot = function(pvector, main=NULL, ...) {
    o = -log10(sort(pvector,decreasing=F))
    e = -log10( 1:length(o)/length(o) )
    plot(e,o,pch=19,cex=1, main=main, ...,
        xlab=expression(Expected~~-log[10](italic(p))),
        ylab=expression(Observed~~-log[10](italic(p))),
        xlim=c(0,max(e)), ylim=c(0,max(o)))
    lines(e,e,col="red")
}
 
#Generate some fake data that deviates from the null
set.seed(42)
pvalues=runif(10000)
pvalues[sample(10000,10)]=pvalues[sample(10000,10)]/5000
 
# pvalues is a numeric vector
pvalues[1:10]
 
# Using the ggd.qqplot() function
ggd.qqplot(pvalues)
 
# Add a title
ggd.qqplot(pvalues, "QQ-plot of p-values using ggd.qqplot")

Have any other solutions for syntax highlighting R code? Please share in the comments!

Github Gists
Inside-R Pretty R Syntax Highlighter

Tuesday, November 16, 2010

Parallelize IBD estimation with PLINK

Obtaining the probability that zero, one, or two alleles are shared identical by descent (IBD) is useful for many reasons in a GWAS analysis. A while back I showed you how to visualize sample relatedness using R and ggplot2, which requires IBD estimates. Using plink --genome uses IBS and allele frequencies to infer IBD. While a recent article in Nature Reviews Genetics on IBD and IBS analysis demonstrates potentially superior approaches, PLINK's approach is definitely the easiest because of PLINK's superior data management capabilities. The problem with IBD inference is that while computation time is linear with respect to the number of SNPs, it's geometric (read: SLOW) with respect to the number of samples. With GWAS data on 10,000 samples, (10,000 choose 2) = 49,995,000 pairwise IBD estimates. This can take quite some time to calculate on a single processor.

A developer in Will's lab, Justin Giles, wrote a Perl script which utilizes one of PLINK's advanced features, --genome-lists, which takes two files as arguments. You can read about this feature under the advanced hint section of the PLINK documentation of IBS clustering. Each of these files contain lists of family IDs and individual IDs of samples for whom you'd like to calculate IBD. In other words, you can break up the IBD calculations by groups of samples, instead of requiring a single process to do it all. The Perl script also takes the base filename of your binary pedfile and parses the .fam file to split up the list of individuals into small chunks. The size of these chunks are specified by the user. Assuming you have access to a high-performance computing cluster using Torque/PBS for scheduling and job submission, the script also writes out PBS files that can be used to submit each of the segments to a node on a supercomputing cluster (although this can easily be modified to fit other parallelization frameworks, so modify the script as necessary). The script also needs all the minor allele frequencies (which can easily be attained with the --freq option in PLINK).

One of the first things the script does is parses and splits up your fam file into chunks of N individuals (where N is set by the user - I used 100 and estimation only took ~15 minutes). This can be accomplished by a simple gawk command:

gawk '{print $1,$2}' data.fam | split -d -a 3 -l 100 - tmp.list

Then the script sets up some PBS scripts (like shell scripts) to run PLINK commands:

At which point you'll have the output files...

data.sub.1.1.genome
data.sub.1.2.genome
data.sub.1.3.genome
data.sub.1.4.genome
data.sub.2.2.genome
data.sub.2.3.genome
data.sub.2.4.genome
data.sub.3.3.genome
data.sub.3.4.genome
data.sub.4.4.genome

...that you can easily concatenate.

Here's the perl script below. To run it, give it the full path to your binary pedfile, the number of individuals in each "chunk" to infer IBD between, and the fully qualified path to your .frq file that you get from running plink --freq. If you're not using PBS to submit jobs, you'll have to modify the code a little bit in the main print statement in the middle. If you're not running this in your /scratch/username/ibd directory, you'll want to change that on line 57. You'll also want to change your email address on line 38 if you want to receive emails from your scheduler if you use PBS.

After you submit all these jobs, you can very easily run these commands to concatenate the results and clean up the temporary files:

cat data.sub.*genome > results.genome
rm tmp.list*
rm data.sub.*

Monday, November 15, 2010

Seminar: A New Measure of Coefficient of Determination for Regression Models

Human Genetics / Biostatistics Associate Professor (and my first statistics teacher) Dr. Chun Li will be giving a talk Wednesday on a new measure of R² for continuous, binary, ordinal, and survival outcomes. Here are the details:

Department of Biostatistics Seminar/Workshop Series

A New Measure of Coefficient of Determination for Regression Models

Chun Li, PhD

Associate Professor, Department of Biostatistics, Vanderbilt University School of Medicine

Wednesday, November 17, 1:30-2:30pm, MRBIII Conference Room 1220
http://biostat.mc.vanderbilt.edu/CLiNov17

Summary: Coefficient of determination is a measure of the goodness of fit for a model. It is best known as R² in ordinary least squares (OLS) for continuous outcomes. However, as a ratio of values on the squared outcome scale, R² is often not intuitive to think of. In addition, extensions of the definition to other outcome types often have unsatisfactory properties. One approach is to define a ratio of two quantities, but often such a definition does not have an underlying decomposition property that is enjoyed by R². Another approach is to employ the connection of R² with the likelihood ratio statistic in linear regression where the residuals follow normal distributions, but for discrete outcomes, this will result in a value less than one even for a perfect model fit. For regression models, we propose a new measure of coefficient of determination as the correlation coefficient between the observed values and the fitted distributions. As a correlation coefficient, this new measure is intuitive and easy to interpret. It takes into account the variation in fitted distributions, and can be readily extended to other outcome types. For OLS, it is numerically the same as R²! We present the new measure for continuous, binary, ordinal, and time-to-event outcomes.

Thursday, November 11, 2010

Split up a GWAS dataset (PED/BED) by Chromosome

As I mentioned in my recap of the ASHG 1000 genomes tutorial, I'm doing to be imputing some of my own data to 1000 genomes, and I'll try to post lessons learned along the way here under the 1000 genomes and imputation tags.

I'm starting from a binary pedigree format file (plink's bed/bim/fam format) and the first thing in the 1000 genomes imputation cookbook is to store your data in Merlin format, one per chromosome. Surprisingly there is no option in PLINK to split up a dataset into separate files by chromosome, so I wrote a Perl script to do it myself. The script takes two arguments: 1. the base filename of the binary pedfile (if your files are data.bed, data.bim, data.fam, the base filename will be "data" without the quotes); 2. a base filename for the output files to be split up by chromosome. You'll need PLINK installed for this to work, and I've only tested this on a Unix machine. You can copy the source code below:

Tuesday, November 9, 2010

Video and slides from ASHG 1000 Genomes tutorials

If you missed the tutorial on the 1000 genomes project data last week at ASHG, you can now watch the tutorials on youtube and download the slides online at http://genome.gov/27542240. Here's a recap of the speakers and topics:

Introduction Gil McVean, Ph.D. Professor of Statistical Genetics University of Oxford
Description of the 1000 Genomes Data Gabor Marth, D.Sc. Associate Professor of Biology Boston College
How to Access the Data Steve Sherry, Ph.D. National Center for Biotechnology Information National Library of Medicine National Institutes of Health. Bethesda, Md.
How to Use the Browser Paul Flicek, Ph.D. European Molecular Biology Laboratory Vertebrate Genomics Team European Bioinformatics Institute (EBI)
Stuctural Variants Jan Korbel, Ph.D. Group Leader, Genome Biology Research Unit Joint Appointment with EMBL-EBI European Molecular Biology Laboratory (Heidelberg, Germany)
How to Use the Data in Disease Studies Jeffrey Barrett, Ph.D. Team Leader, Statistical and Computational Genetics Wellcome Trust Sanger Institute Hinxton, United Kingdom

Visit http://genome.gov/27542240 for links to all the videos and slides. I found Jeff Barrett's overview of using the 1000 genomes data for imputation particularly helpful. Also, don't forget about Goncalo Abecasis's 1000 genomes imputation cookbook, which gives a little more detailed information about formatting, parallelizing code, etc. I'm going to be trying this myself soon, and I'll post tips along the way.

Friday, November 5, 2010

Keep up with what's happening at ASHG 2010

As this year's ASHG meeting starts to wind down be sure to check out Variable Genome where Larry Parnell is summarizing what's going on at the talks he's been to. Also see the commentary on Genetic Inference by Luke Jostins. The 1000 Genomes tutorial from Wednesday night will be made available on genome.gov soon, and the presidential address, plenary sessions, and distinguished speaker symposium talks were recorded and will also soon be online. You can keep up with what's going on in real time by following the #ASHG2010 tag on Twitter.

This blog has moved!