Getting Genetics Done

Thursday, January 13, 2011

So long Vanderbilt, and thanks for all the fish!

After finishing the final revisions on my dissertation I was reminded of this spot-on graphical guide to what a Ph.D. is really all about.

Now that I'm finished, I'm leaving Vanderbilt to start a postdoc in genetic epidemiology with Dr. Loic Le Marchand at the University of Hawaii Cancer Center. Posts may be sparse over the next few weeks, but I plan on blogging as usual once I'm set up at my postdoc. Because I won't have the same level of statistical and bioinformatics support in Hawaii that I have now, I'll have much to figure out on my own, so I'll have even more to write about here. But for now, enjoy this Illustrated guide to a Ph.D., reproduced with permission from Matt Might, and follow me on Twitter (@genetics_blog).

...

Imagine a circle that contains all of human knowledge:

By the time you finish elementary school, you know a little:

By the time you finish high school, you know a bit more:

With a bachelor's degree, you gain a specialty:

A master's degree deepens that specialty:

Reading research papers takes you to the edge of human knowledge:

Once you're at the boundary, you focus:

You push at the boundary for a few years:

Until one day, the boundary gives way:

And, that dent you've made is called a Ph.D.:

Of course, the world looks different to you now:

So, don't forget the bigger picture:

Keep pushing!

Monday, January 10, 2011

R function for extracting F-test P-value from linear model object

I thought it would be trivial to extract the p-value on the F-test of a linear regression model (testing the null hypothesis R²=0). If I fit the linear model: fit<-lm(y~x1+x2), I can't seem to find it in names(fit) or summary(fit). But summary(fit)$fstatistic does give you the F statistic, and both degrees of freedom, so I wrote this function to quickly pull out the p-value from this F-test on a lm object, and added it to my R profile. If there's a built-in R function to do this, please comment!

Thursday, December 16, 2010

Epistasis in New Places

Coming from the lineage of Jason Moore, I am obliged to occasionally remind everyone that biological systems are inherently complex, and to some degree, we should therefore expect statistical models involving those systems to be complex as well.

With the development of GWAS, many approaches to examine epistasis are weighed down by the computational burden of exhaustively conducting billions of statistical tests. With this in mind, several bioinformatics approaches (such as Biofilter and INTERSNP) have focused on looking for gene-gene interactions within biological pathways, ontologies, or protein-protein interaction networks. The assumption underlying these methods is that interactions occur between variants of two different genes – what you could call trans-epistasis.

Considering the epic complexity of the transcriptions process, the genetics of gene expression seems just as likely to harbor epistasis as biological pathways. Following the excellent work of Barbara Stranger, Jonathan Pritchard, and various other luminaries in this area, Stephen Turner and I examined HapMap genotypes and gene expression levels from corresponding cell lines to look for cis-epistasis.

We found 79 genes where SNP pairs in the gene's regulatory region can interact to influence the gene's expression. What is perhaps most interesting is that there are often large distances between the two interacting SNPs (with minimal LD between them), meaning that most haplotype and sliding window approaches would miss these effects. The full text is available online: "Multivariate analysis of regulatory SNPs: empowering personal genomics by considering cis-epistasis and heterogeneity."

Wednesday, December 15, 2010

Which Reference Management Software do you use? (Reader Poll)

When I started grad school I started using Reference Manager (RefMan), similar to EndNote, to manage my references and bibliographies. It's a real pain, and I often feel like I'm powering my computer with the endless pumping and clicking of the mouse that it takes to import a reference into my library.

Recently I've started using Zotero because of how easy it is to import references, store PDFs, and sync between computers. It also integrates with MS Word and allows you to insert citations and format a bibliography using any of EndNote's styles. And it's free.

Before I make the switch and leave RefMan for good, I would love to see what everyone else here uses to manage references. I know many of you use social bookmarking sites like CiteULike, del.icio.us, FriendFeed and others to save and share literature, but I'm really interested to see what software you use while writing to manage references and format bibliographies, and how satisfied you are with what you use.

Thanks for responding! Check back in a few days and I'll summarize what you all said.

Tuesday, December 14, 2010

Sync your Zotero Library with Dropbox using WebDAV

About a year ago I wrote a post about Dropbox - a free, awesome, cross-platform utility that syncs files across multiple computers and securely backs up your files online. Dropbox is indispensable in my own workflow. I store all my R code, perl scripts, and working manuscripts in my Dropbox. You can also share folders on your computer with other Dropbox users, which makes coauthoring a paper and sharing manuscript files a trivial task. If you're not using it yet, start now.

I've also been using Zotero for some time now to manage my references. What's nice about Zotero over RefMan, EndNote and others, is that it runs inside Firefox, and when you're on a Pubmed or Journal website, you can save a reference and the PDF with a single click within your Zotero library. Zotero also interfaces with both MS Word and OO.o, and uses all the standard EndNote styles for formatting bibliographies.

You can also sync your Zotero library, including all your references, snapshots of the HTML version of all your articles, and all the PDFs using the Zotero servers. This syncs your library to every other computer you're using. This is nice when you're away from the office and need to look at a paper, but you're not on your institution's LAN and journal articles are paywalled. The problem with Zotero is a low storage limit - you only get tiny 100MB storage space for free. Have any more papers or references you want to sync and you have to pay for it.

That's if you use Zotero's servers. You can also sync your library using your own WebDAV server. Go into Zotero's preferences and you'll see this under the sync pane.

Here's where Dropbox comes in handy. You get 2GB for free when you sign up for Dropbox, and you can add tons more space by referring others, filling out surveys, viewing the help pages, etc. I've bumped my free account up to 19GB. Dropbox doesn't support WebDAV by itself, but a 3rd party service, DropDAV, allows you to do this. Just give DropDAV your Dropbox credentials, and you now have your own WebDAV server at https://dav.dropdav.com. Now simply point Zotero to sync with your own DropDAV server rather than Zotero's servers, and you can sync gigabytes of references and PDFs using your Dropbox.

Why not simply move the location of your Zotero library to a folder in your dropbox and forget syncing altogether? I did that for a while, but as long as Firefox is open, Zotero holds your library files open, which means they're not syncing properly. If you have instances of Firefox open on more than one machine you're going to run into trouble. Syncing to Dropbox with DropDAV only touches your Dropbox during a Zotero sync operation.

What you'll need:

1. Dropbox. Sign up for a free 2GB Dropbox account. If you use this special referral link, you'll get an extra 250MB for free. Create a folder in your Dropbox called "zotero."

2. DropDAV. Log in here with your Dropbox credentials and you'll have DropDAV up and running.

3. Firefox + Zotero. First, start using Firefox if you haven't already, then install the Zotero extension.

4. Connect Zotero to DropDAV. Go into Zotero's preferences, sync panel. See the screenshot above to set your Zotero library to sync to your Dropbox via WebDAV using DropDAV.

You're done! Now, go out and start saving/syncing gigabytes of papers!

Tuesday, December 7, 2010

Webinar on Revolution R Enterprise

R evangelist David Smith, marketing VP at Revolution R, will be giving a webinar showing off some of the finer features of Revolution R Enterprise - an integrated development environment (IDE) for R that has an enhanced script editor with syntax highlighting, function completion, suntax checking, mouseover help, R code snippets for common tasks, an object browser, a real debugger, and more. Revolution R Enterprise is free for academics. The webinar is tomorrow (Wednesday December 8) at 9am Pacific time (11 CST), and you can register here.

I've been happy using NppToR - a utility that adds syntax highlighting, code folding, and a hotkey to send lines of R code from Notepad++ (hands down the best text editor for Windows) to the R console. You can read more about NppToR on page 62 of the June issue of the R journal. But it looks like the Revolution R Enterprise IDE has much more to offer. Here's an example of the debugger with breakpoints set.

Webinar - Revolution R Enterprise - 100% R and More

Monday, December 6, 2010

Using the "Divide by 4 Rule" to Interpret Logistic Regression Coefficients

I was recently reading a bit about logistic regression in Gelman and Hill's book on hierarchical/multilevel modeling when I first learned about the "divide by 4 rule" for quickly interpreting coefficients in a logistic regression model in terms of the predicted probabilities of the outcome. The idea is pretty simple. The logistic curve (predicted probabilities) is steepest at the center where a+ßx=0, where logit^-1(x+ßx)=0.5. See the plot below (or use the R code to plot it yourself).

The slope of this curve (1st derivative of the logistic curve) is maximized at a+ßx=0, where it takes on the value:

ße⁰/(1+e⁰)²

=ß(1)/(1+1)²

=ß/4

So you can take the logistic regression coefficients (not including the intercept) and divide them by 4 to get an upper bound of the predictive difference in probability of the outcome y=1 per unit increase in x. This approximation the best at the midpoint of x where predicted probabilities are close to 0.5, which is where most of the data will lie anyhow.

So if your regression coefficient is 0.8, a rough approximation using the ß/4 rule is that a 1 unit increase in x results in about a 0.8/4=0.2, or 20% increase in the probability of y=1.

Tuesday, November 30, 2010

Abstract Art with PubMed2Wordle

While preparing for my upcoming defense, I found a cool little web app called pubmed2wordle that turns a pubmed query into a word cloud using text from the titles and abstracts returned by the query. Here are the results for a pubmed query for me ("turner sd AND vanderbilt"):

And quite different results for where I'm planning to do my postdoc:

Looks useful to quickly get a sense of what other people work on.

http://www.pubmed2wordle.appspot.com/

This blog has moved!