Wednesday, December 30, 2009

Use plyr instead of _apply() in R

I've covered plyr once before, showing you how to get means and variances for two quantitative traits across multilocus genotypes. JD Long over at Cerebral Mastication recently posted a nice screencast illustrating how plyr "just works" as an alternative to R's family of apply commands.  There's a set of R functions (apply, sapply, lapply, tapply, eapply, and rapply) that can apply a command or function to your data and return a hopefully useful result.  However, for the non-programmers among us, choosing which apply function to use and how to use it can be mind-bogglingly confusing.  I've never gotten one of these functions to work as I wanted it to the first time around, and I often end up writing loops where the vectorized operation would be much faster.

Enter plyr.

As I mentioned previously, the plyr functions (ddply, in particular), are intuitive, usually returning the result that you wanted. The ddply function splits up your dataset based on one or more grouping variables, applies some function or statistic, and summarizes the results returning a dataframe.

Here's JD Long's screencast showing how plyr makes a task like this easy where the apply function fails.



Cerebral Mastication - Struggling with apply() in R

Monday, December 28, 2009

Capture system commands as R objects with system(..., intern=T)

Just discovered this very handy R command to capture the output from a system command as an R object.  I wanted to use R to read in the output from another program (PLINK) and do some processing on each output file. Of course if the files are named sequentially (plink1.out, plink2.out, plink3.out, etc.) this would be simple with a for loop.  This wasn't the case for me, but I could still list all the files I wanted to process with some pattern matching in Unix using wildcards.  All the files had "_hdl" in the name, and ended with ".qassoc".  Here's where the system() function is helpful.

This command issues "ls *_hdl*.qassoc" to the system, just as if you were typing it in at the terminal.  The intern=TRUE tells the system command to treat the output as an R object.  I can store this in myfiles, then do some processing on myfiles, which is a vector containing all the filenames of files I want to process.

myfiles = system("ls *_hdl*.qassoc", intern=TRUE)

myfiles
 [1] "0-all.01_hdl_modeled.qassoc"
 [2] "0-all.02_hdl_modeled_polyresid.qassoc"
 [3] "0-all.03_hdl_med.qassoc"
 [4] "0-all.04_hdl_med_polyresid.qassoc"
 [5] "0-all.05_hdl_med_smokage_polyresid.qassoc"
 [6] "1-male.01_hdl_modeled.qassoc"
 [7] "1-male.02_hdl_modeled_polyresid.qassoc"
 [8] "1-male.03_hdl_med.qassoc"
 [9] "1-male.04_hdl_med_polyresid.qassoc"
[10] "1-male.05_hdl_med_smokage_polyresid.qassoc"

Tuesday, December 22, 2009

Sync files across multiple computers with Dropbox (PC, Mac & Linux too!)

Do you ever find yourself switching back and forth between your work computer, your laptop, and your home computer?  This happens to me all the time when I'm writing.  Rather than carry all your files on a USB stick and risk losing it or corrupting your data, give Dropbox a try.  It's dead simple, and works for PC, Mac, and Linux too.

Once you sign up and install on all your computers, you'll have a special folder, where if you save something there on one computer, it is automatically created and stays synchronized in the same folder on all your other computers.  What's more, if you use someone else's computer, you can access all your files through a web interface because they're all securely backed up online.  I've been using this for a while now to sync all the papers I'm working on, RefMan/EndNote databases, config files, and R functions I reuse all the time.  You can also create "public" folders.  Put something here, and you can get a direct link to the file online to share with other folks. For example, here's a link to some R code I wrote to use ggplot2 to make manhattan plots and QQ-plots for every PLINK output file in the current directory (I'm hoping to clean this code up and include this with some other functions I've written into a package on CRAN soon).

I can't recommend this little app enough. If you're still not convinced, check out this short video that explains what Dropbox is all about and shows of just how simple it is to use.

You get a whopping 2GB for free, but if you use the registration link provided below, you'll get an extra free 1/4GB.  Happy holidays from GGD, and I'll catch up with you all next week!

Dropbox - Secure online backup and synchronization

Thursday, December 17, 2009

Review: The challenges of sequencing by synthesis

A tip of the hat to a commenter on my previous coverage of a next-gen sequencing paper for pointing out this detailed and perhaps more technically-oriented review on sequencing by synthesis recently published in Nature Biotechnology.  Thanks, Clive.

Review: The challenges of sequencing by synthesis (Nature Biotechnology)

Wednesday, December 16, 2009

Recent improvements to Pubget

If you've never heard of it before, check out my previous coverage on Pubget. It's like PubMed, but you get the PDFs right away.  Pubget has recently implemented a number of improvements.

1. Citation matching.  Pubget's citation matcher seems to work better than Pubmed most of the time.  Try going to Pubget and pasting any of these random citations into the search bar:

J Biol Chem 277: 30738-30745
Nucleic Acids Res 2004;32:4812-20.
Evol. Biol. 7, 214 (2007).


2. The PaperPlane bookmarklet.  Go here and drag the link to your bookmark toolbar.  Now, if you're searching from pubmed, click the bookmarklet for one-click access to the PDF.

3. If you have a long list of PMIDs, separate them with commas and you can paste them directly into the search bar.

Pubget (Vanderbilt institutional link)

Pubget (If you're anywhere else)

Tuesday, December 15, 2009

Seminar announcement: A Multivariate Methodology for Analyzing Genome-wide Association Studies

This looks interesting.

Department of Biostatistics Seminar/Workshop Series: A Multivariate Methodology for Analyzing Genome-wide Association Studies, by Janice Brodsky, PhD, UCLA.

Wednesday, December 16, 1:30-2:30pm, MRB III Conference Room 1220

Intended Audience: Persons interested in applied statistics, statistical theory, epidemiology, health services research, clinical trials methodology, statistical computing, statistical graphics, R users or potential users

In the last few years, high-dimensional genome-wide association (GWA) studies have become a common tool in genetics for investigating which genes are associated with physical traits. However, the results of many GWA studies have fewer genes than expected or even no genes at all. This does not necessarily indicate that there are no genetic associations in the data: genes with weaker associations or which only work in groups will be missed with the standard GWA statistical analysis. We present a multivariate methodology for analyzing GWA data which is designed to handle weaker signals, dependent data, and multicollinearity. We applied this method to a large GWA study, and the results were consistent with previously performed studies. We also discuss extensions of the methodology.

Follow GGD on Twitter @genetics_blog


GGD is now on Twitter! I'll be linking to all of our posts on the Twitter page, and occasionally post something there that may not make its way into a full length post here on the blog. You can follow us on Twitter here @genetics_blog.

Browse R Graphics with the R Graph Gallery and the R Graphical Manual

One of R's biggest strengths is its unparalleled graphing capabilities.  Just see any of our previous posts on ggplot2, visualization, or other posts tagged with R. R has several fundamentally different systems for plotting, including base graphics, lattice, and ggplot2.  Furthermore, many add-on packages come with their own functions for producing problem-domain specific graphics. For example, see GenABEL, a very nice R package for GWAS analysis, which has functions for producing manhattan plots, LD plots, etc.

Now let's say you've seen a certain graphic before, and you want to find the package you need to download and which function you should use to make the plot.  That's where the R Graph Gallery and the R Graphical Manual can become very useful.  Both sites give you thumbnail previews of graphics produced by functions bundled with certain R packages, code for producing the graphic, and which R packages you need to download for the functions used to create the graphic.  The R Graphical Manual is much more comprehensive, and is categorized based on CRAN Task Views (CTV) categories (check out all 29 pages of graphics in the Genetics task view).

R Graphical Manual

R Graph Gallery

Monday, December 14, 2009

Sequencing technologies — the next generation



Following up on last week's coverage of the Genotyping Portal, check out this new review article on next-generation sequencing in Nature Reviews Genetics.  One major focus of this paper is that the next generation of sequencing platforms each use fundamentally different technologies.  Because of this, it's likely that multiple platforms will coexist in the marketplace, and different platforms will have clear advantages over others for particular biological applications.  The paper has some nice figures illustrating how the technology works in sequencing by reversible terminators used by Illumina/Solexa and Helicos BioSciences, emulsion PCR used by Life/APG's SOLiD ligation platform and the Roche/454 Pyrosequencing system, and the highly-anticipated real-time single-molecule sequencing from Pacific Biosciences.  Finally, there's a table giving the pros, cons, biological applications, cost, read length, run time, and references for each of the next-gen sequencing applications.  Finally, a revealing piece of information I found in the last table showing sequencing statistics on personal genomes shows that the sequencing of Stephen Quake's genome with Helicos a few months ago cost only $48,000, a decrease of several orders of magnitude compared to the sequencing of J. Craig Venter's genome (Sanger), which cost an estimated $70,000,000 just a few years ago.

The author of the paper, Michael L. Metzker, is an associate professor of genetics at Baylor College of Medicine, a senior manager at the Human Genome Sequencing Center at Baylor, and President & CEO of LaserGen, Inc., Houston, TX.

Sequencing Technologies - The Next Generation (NRG AOP)

Tuesday, December 8, 2009

Genotyping Portal: A comprehensive (and freely available) online resource about methods for DNA genotyping, screening and sequencing



Diego Forero has compiled a comprehensive list of primary publications on commonly used SNP genotyping and DNA sequencing technologies (including SNP arrays, Sequenom, TaqMan, Pyrosequencing, Molecular Beacons, FP-TDI, Invader, xMAP, SNaPshot, SNPlex, Sanger, 454, Illumina, Helicos, SOLiD, Complete Genomics, Bisulfite sequencing, and others).  Also included here are links to review articles, protocols, and links to manufacturers of reagents and equipment.  Where available, links are included to open access versions of the papers on PubMed Central.

This is an excellent resource for anyone who is generally interested in how these technologies work.  For 2nd year grad students at Vanderbilt, you will be asked about some of these technologies on your qualifying exam!

Genotyping Portal: A comprehensive (and freely available) online resource about methods for DNA genotyping, screening and sequencing

Monday, December 7, 2009

Use PuTTY and XMing to see Linux graphics via SSH on your Windows computer

Do you use SSH to connect to a remote Linux machine from your local Windows computer?  Ever needed to run a program on that Linux machine that displays graphical output, or uses a GUI? I was in this position last week trying to make figures using ggplot2 in R of results from an analysis of GWAS data which required using a 64-bit Linux machine with more RAM than my 32-bit windows machine can see.

You try plotting something in R on a Linux machine in an SSH session you'll get this nasty error message:

Error in function (display = "", width, height, pointsize, gamma, bg,: 
X11 I/O error while opening X11 connection to 'localhost:10.0'

Turns out there's a very easy way to see graphical output over your SSH terminal.  First, if you're not already using PuTTY for SSH, download putty.exe from here.  Next, download, install, and run Xming.  While Xming is running in your system tray, log into the Linux server as you normally would using PuTTY.  Then type this command at the terminal to log into the linux server of your choice (here, pepperjack), with the -X (uppercase) to enable X11 forwarding.

ssh -X pepperjack.mc.vanderbilt.edu

If all goes well you should now be able to use programs that utilize graphical output or interfaces, which are running on the remote Linux machine rather than your local windows computer.




Xming - PC X Server

Xming download link on SourceForge

Tuesday, December 1, 2009

Get Started with Machine Learning in R

A Beautiful WWW put together a great set of resources for getting started with machine learning in R.  First, they recommend the previously mentioned free book, The Elements of Statistical Learning.  Then there's a link to a list of dozens of machine learning and statistical learning packages for R.  Next, you'll need data.  Hundreds of free real datasets are available at the UCI machine learning repository.  Each dataset, such as this breast cancer dataset from Wisconsin, has its own page giving a summary, links to publications of major findings, and detailed descriptions of the variables in the data.  If you want to simulate genetic data, check out our software, genomeSIMLA, capable of simulating gene-gene interactions in case-control and family-based GWAS-sized datasets with realistic patterns of linkage disequilibrium.  If you're interested, check out the genomeSIMLA paper.  Finally, if time is not an issue, consider taking MIT's OpenCourseWare Machine Learning course.  Alternatively, check out Stanford Engineering professor Andrew Ng - all his lectures are available on youtube.  Here's the first lecture.


For more, check out the link below.

A beautiful WWW: Guide to Getting Started in Machine Learning
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.