Tuesday, May 25, 2010

Use SQL queries to manipulate data frames in R with sqldf package

I've covered a few topics in the past including the plyr package, which is kind of like "GROUP BY" for R, and the merge function for merging datasets. I only recently found the sqldf package for R, and it's already one of the most useful packages I've ever installed. The main function in the package is sqldf(), which takes a quoted string as an argument. You can treat data frames as tables as if they were in a relational database. You can use some of the finer aspects of SQL like the INNER JOIN or the subquery, which are extremely difficult operations to mimic using standard R programming. While this isn't an SQL tutorial, try out some of these commands to see what sqldf can do for you. Read more about the sqldf package here.


> # install the package
> install.packages("sqldf")
> 
> #load it
> library(sqldf)
> 
> # set the random seed
> set.seed(42)
> 
> #generate some data
> df1 = data.frame(id=1:10,class=rep(c("case","ctrl"),5))
> df2 = data.frame(id=1:10,cov=round(runif(10)*10,1))
> 
> #look at the data
> df1
   id class
1   1  case
2   2  ctrl
3   3  case
4   4  ctrl
5   5  case
6   6  ctrl
7   7  case
8   8  ctrl
9   9  case
10 10  ctrl
> df2
   id cov
1   1 9.1
2   2 9.4
3   3 2.9
4   4 8.3
5   5 6.4
6   6 5.2
7   7 7.4
8   8 1.3
9   9 6.6
10 10 7.1
> 
> # do an inner join
> sqldf("select * from df1 join df2 on df1.id=df2.id")
   id class id cov
1   1  case  1 9.1
2   2  ctrl  2 9.4
3   3  case  3 2.9
4   4  ctrl  4 8.3
5   5  case  5 6.4
6   6  ctrl  6 5.2
7   7  case  7 7.4
8   8  ctrl  8 1.3
9   9  case  9 6.6
10 10  ctrl 10 7.1
> 
> # where clauses
> sqldf("select * from df1 join df2 on df1.id=df2.id where class='case'")
  id class id cov
1  1  case  1 9.1
2  3  case  3 2.9
3  5  case  5 6.4
4  7  case  7 7.4
5  9  case  9 6.6
> 
> # lots of sql fun
> sqldf("select df1.id, df2.cov as covariate from df1 join df2 on df1.id=df2.id where class='case' and cov>3 order by cov")
  id covariate
1  5       6.4
2  9       6.6
3  7       7.4
4  1       9.1

Thursday, May 20, 2010

Tutorial: Principal Components Analysis (PCA) in R

Found this tutorial by Emily Mankin on how to do principal components analysis (PCA) using R. Has a nice example with R code and several good references. The example starts by doing the PCA manually, then uses R's built in prcomp() function to do the same PCA.

Principle Components Analysis: A How-To Manual for R

Friday, May 14, 2010

LocusZoom v1.1 - Create Regional Plots of GWAS Results

Previously mentioned LocusZoom has undergone some major updates over the last few months. Many of the bugs mentioned in my previous post are now fixed, and now there's a good bit of documentation available. There are also a few new features, including the ability to add an extra column to your results file to change the plotting symbol to reflect your own custom annotation (i.e. whether the SNP was imputed or genotyped, or the SNP's function).


This software is seriously useful for plotting regional association results with the level of detail and annotation that you can't achieve using regular manhattan plots. Go give it a try, now. Also keep an eye out for a downloadable version that should be available in the next week or so.

LocusZoom: Create Regional Plots of GWAS Results

And as a suggestion to the developers: how about a radio button on the web app that would allow you to accept files in PLINK's .assoc/.qassoc format so PLINK users wouldn't have go through the awkward text-wrangling to get their results files in an acceptable format (instructions specific for LocusZoom), also making "P" and "SNP" the default column names for these values?

Thursday, May 13, 2010

Using R, LaTeX, and Sweave for Reproducible Research: Handouts, Templates, & Other Resources

Several readers emailed me or left a comment on my previous announcement of Frank Harrell's workshop on using Sweave for reproducible research asking if we could record the seminar. Unfortunately we couldn't record audio or video, but take a look at the Sweave/Latex page on the Biostatistics Dept Wiki. Here you can find Frank's slideshow from today and the handout from today (a PDF statistical report with all the LaTeX/R code necessary to produce it). While this was more of an advanced Sweave/LaTeX workshop, you can also find an introduction to LaTeX and an introduction to reproducible research using R, LaTeX, and Sweave, both by Theresa Scott.

In addition to lots of other helpful tips, you'll also find the following resources to help you learn to use both Sweave and LaTeX:

Wednesday, May 12, 2010

We're Hiring a Postdoc

Our lab is looking for a postdoc. See the ad here, reproduced below.

POST-DOCTORAL POSITION
VANDERBILT CENTER FOR HUMAN GENETICS RESEARCH
PROGRAM IN COMPUTATIONAL GENOMICS 


The Program in Computational Genomics in the CHGR at Vanderbilt University has an immediate opening for a post-doctoral fellow to pursue new and exciting research in human genetics. The successful candidate will have a Ph.D. degree (or equivalent) in genetics, human genetics, epidemiology, computational biology, bioinformatics, biostatistics, or related field. The successful candidate will work as part of an established research team and will have access to several large genome-wide association study (GWAS) datasets and numerous follow-up studies for association and copy number variation. Both established and evolving methods to detect and characterize single and multi-locus effects will be applied, and rich phenotypic data will permit analysis of discrete and quantitative traits. The candidate will integrate data from linkage, association, CNV, and re-sequencing studies along with knowledge of gene networks to identify susceptibility genes. He/She will also have the opportunity to conduct research in methods development in the study of gene-gene and gene-environment interactions for complex disease.  In addition, the candidate will have the opportunity to interact with numerous senior investigators in multiple fields.

The CHGR is an interdisciplinary center with over 40 faculty representing numerous clinical and basic science departments.  It has a highly interactive research program organized into three thematic programs:  Disease Gene Discovery, Computational Genomics, and Translational Genetics.  The CHGR has substantial core facilities for family and patient ascertainment; DNA banking, genotyping, and sequencing; and computational genomics, data management, and data analysis.  It occupies over 14,000 sf of newly appointed wet and dry lab space.  The CHGR faculty and staff enjoy the substantial benefits of the collaborative Vanderbilt atmosphere.  More information about the specific CHGR post-doctoral positions can be found at:  http://chgr.mc.vanderbilt.edu/chgr-careers/postdoc.

Interested candidates should forward their C.V. a description of their research interests (preferably by email), and three letters of reference by June 30th, 2010 to:

Dr. Marylyn Ritchie, PhD
c/o Maria Ritchie
Center for Human Genetics Research
Vanderbilt University
519 Light Hall
Nashville, TN  37232-0700
Email:  mari...@chgr.mc.vanderbilt.edu
Tel:  615-322-7909
Fax: 615-343-8619

Tuesday, May 11, 2010

Sweave for Reproducible Research and Beatiful Statistical Reports

Frank Harrell, chair of the Biostatistics department here at Vanderbilt, is giving a seminar entitled "Sweave for Reproducible Research and Beautiful Statistical Reports" tomorrow, Wednesday, May 12, 1:30-2:30pm, in the MRBIII Conference Room 1220. This tutorial covers the basics of Sweave and shows how to enhance the default output in various ways by using: latex methods for converting R objects to LaTeX markup, your own floating figure environments, the LaTeX listings package to pretty-print R code and its output.

I'll post any handouts after the workshop.

Here's the description from the seminar website:

Much of research that uses data analysis is not easily reproducible. This can be for a variety of reasons related to tweaking of instrumentation, the use of poorly studied high-dimensional feature selection algorithms, programming errors, lack of adequate documentation of what was done, too much copy and paste of results into manuscripts, and the use of spreadsheets and other interactive data manipulation and analysis tools that do not provide a usable audit trail of how results were obtained. Even when a research journal allows the authors the "luxury" of having space to describe their methods, such text can never be specific enough for readers to exactly reproduce what was done. All too often, the authors themselves are not able to reproduce their own results. Being able to reproduce an entire report or manuscript by issuing a single operating system command when any element of the data change, the statistical computing system is updated, graphics engines are improved, or the approach to analysis is improved, is also a major time saver.

It has been said that the analysis code provides the ultimate documentation of the "what, when, and how" for data analyses. Eminent computer scientist Donald Knuth invented literate programming in 1984 to provide programmers with the ability to mix code with documentation in the same file, with ``pretty printing'' customized to each. Lamport's LaTeX, an offshoot of Knuth's TeX typesetting system, became a prime tool for printing beautiful program documentation and manuals. When Friedrich Leisch developed Sweave in 2002, Knuth's literate programming model exploded onto the statistical computing scene with a highly functional and easy to use coding standard using R and LaTeX and for which the Emacs text editor has special dual editing modes using ESS. This approach has now been extended to other computing systems and to word processors. Using R with LaTeX to construct reproducible statistical reports remains the most flexible approach and yields the most beautiful reports, while using only free software. One of the advantages of this platform is that there are many high-level R functions for producing LaTeX markup code directly, and the output of these functions are easily directly to the LaTeX output stream created by Sweave.

This tutorial covers the basics of Sweave and shows how to enhance the default output in various ways by using: latex methods for converting R objects to LaTeX markup, your own floating figure environments, the LaTeX listings package to pretty-print R code and its output.

These methods apply to everyday statistical reports and to the production of 'live' journal articles and books.

R Package 'rms' for Regression Modeling

If you attended Frank Harrell's Regression Modeling Strategies course a few weeks ago, you got a chance to see the rms package for R in action. Frank's rms package does regression modeling, testing, estimation, validation, graphics, prediction, and typesetting by storing enhanced model design attributes in the fit. rms is a re-written version of the Design package that has improved graphics and duplicates very little code in the survival package.

First install the rms package:
install.packages("rms")

Next, load the package:
library(rms)

Finally, run this command to get very extensive documentation of many of the features in the rms package:
?rmsOverview

You can also get a walk-through consisting of several example uses of functions in the rms package for modeling and graphics.
example(rmsOverview)

Thursday, May 6, 2010

Mixed linear model approach adapted for genome-wide association studies

A few weeks ago I covered an R package for efficient mixed model regression that is capable of simultaneously accounting for both population stratification and relatedness to compute unbiased estimates of standard errors and p-values for genetic association studies. Fitting linear mixed effects models on GWAS scale can be very time consiuming, however, and another group recently reported a method that fits a mixed linear model very efficiently by clustering individuals into groups and eliminating the need to recompute variance components. They showed that using their modifications, they were able to reduce computation time by more than 800-fold over SAS proc mixed / SAS proc cluster. Check out the paper for more details.

Nature Genetics: Mixed linear model approach adapted for genome-wide association studies

Abstract: Mixed linear model (MLM) methods have proven useful in controlling for population structure and relatedness within genome-wide association studies. However, MLM-based methods can be computationally challenging for large datasets. We report a compression approach, called 'compressed MLM', that decreases the effective sample size of such datasets by clustering individuals into groups. We also present a complementary approach, 'population parameters previously determined' (P3D), that eliminates the need to re-compute variance components. We applied these two methods both independently and combined in selected genetic association datasets from human, dog and maize. The joint implementation of these two methods markedly reduced computing time and either maintained or improved statistical power. We used simulations to demonstrate the usefulness in controlling for substructure in genetic association datasets for a range of species and genetic architectures. We have made these methods available within an implementation of the software program TASSEL.

Monday, May 3, 2010

Introduction to single molecule real time (SMRT) sequencing from Pac Bio's Curtis Fideler

Definitely a seminar not to miss: Curtis Fideler, director of Sales at Pacific Biosciences, will be giving a special seminar here at Vanderbilt Thursday, May 20, 11:00a-noon in 202 Light Hall entitled "An Introduction to SMRT Sequencing: A description of Pacific Biosciences single molecule real time sequencing technology."

Science: Real-Time DNA Sequencing from Single  Polymerase Molecules

Wikipedia: Single Molecule Real Time Sequencing

Pacific Biosciences: SMRT Overview
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.