Getting Genetics Done

Friday, September 25, 2009

What happens when a consumer genetics company goes bankrupt?

Dan Vorhaus and Lawrence Moore recently put together this excellent three part series on Genomics Law Report. Headlines about deCODE Genetics on the brink of insolvency and major shifts in the upper management of 23andMe inspired this series of posts on what would happen when a direct-to-consumer (DTC) genomics company goes declares bankruptcy.

Bankruptcy law authorizes the sale of the assets of a business in bankruptcy, and genomic data is likely the most valuable asset of any DTC genomics company. First the authors dissect the privacy policy and terms of service for three major DTC companies: 23andMe, deCODE Genetics, and TruGenetics. Next there's a discussion of how the legal system would treat a DTC genomics company's bankruptcy. The series wraps up with a brief discussion of how this ultimately affects the average DTC genomics cutomer.

Genomics Law Report: What happens if a DTC Genomics Company Goes Belly-Up?

Wednesday, September 23, 2009

JBrowse: a JavaScript Based Genome Browser

Genome Browsers are nothing new, but JBrowse is a new JavaScript based genome browser that uses information from the UCSC genome browser and has the look and feel of Google Maps. It's extremely easy to zoom in and out and scroll around because all the "work" is being done by your computer rather than some server farm thousands of miles away. OpenHelix is calling it a gamechanger, and they have a nice video demonstration showing off some of JBrowse's features. Click the Drosophila or Homo sapiens genome and give JBrowse a spin for yourself!

The JBrowse genome browser

Monday, September 21, 2009

Comparison of plots using Stata, R base, R lattice, and R ggplot2, Part I: Histograms

One of the nicer things about many statistics packages is the extremely granular control you get over your graphical output. But I lack the patience to set dozens of command line flags in R, and I'd rather not power the computer by pumping the mouse trying to set all the clicky-box options in Stata's graphics editor. I want something that just looks nice, using the out-of-the-box defaults. Here's a little comparison of 4 different graphing systems (three using R, and one using Stata) and their default output for plotting a histogram of a continuous variable split over three levels of a categorical variable.

First I'll start with the three graphing systems in R: base, lattice, and ggplot2. If you don't have the last two packages installed, go ahead and download them:

install.packages("ggplot2")

install.packages("lattice")

Now load these two packages, and download this fake dataset I made up containing 100 samples each from three different genotypes ("geno") and a continuous outcome ("trait")

mydat=read.csv("http://people.vanderbilt.edu/~stephen.turner/ggd/2009-09-21-histodemo.csv",header=T)

library(ggplot2)

library(lattice)

Now let's get started...

R: base graphics

par(mfrow=c(3,1))

with(subset(mydat,geno=="aa"),hist(trait))

with(subset(mydat,geno=="Aa"),hist(trait))

with(subset(mydat,geno=="AA"),hist(trait))

R: lattice

histogram(~trait | factor(geno), data=mydat, layout=c(1,3))

R: ggplot2

qplot(trait,data=mydat,facets=geno~.)

# Update Tuesday, September 22, 2009
# A commenter mentioned that this code did not work.
# If the above code does not work, try explicitly
# stating that you want a histogram:
qplot(trait,geom="histogram",data=mydat,facets=geno~.)

Stata

insheet using "http://people.vanderbilt.edu/~stephen.turner/ggd/2009-09-21-histodemo.csv", comma clear
histogram trait, by(geno, col(1))

Commentary

In my opinion ggplot2 is the clear winner. Again I'll concede that all of the above graphing systems give you an incredible amount of control of every aspect of the graph, but I'm only looking for what gives me the best out-of-the-box default plot using the shortest command possible. R's base graphics give you a rather spartan plot, with very wide bins. It also requires 4 lines of code. (If you can shorten this, please comment). By default, the base graphics system gives you counts (frequency) on the vertical axis. The lattice package in R does a little better perhaps, but the default color scheme is visually less than stellar. Also, I'm not sure why the axis labels switch sides every other plot, and the ticks on top of the plot are probably unnecessary. I still think the bins are too wide. You lose some information especially on the bottom plot towards the right tail. The vertical axis is proportion of total. Stata's default plot looks very similar to lattice, but again uses a very unattractive color scheme. It uses density for the vertical axis, which may not mean much to non-statisticians. The default plot made by ggplot2 is just hands-down good-looking. There are no unnecessary lines delimiting the bins, and the binwidth is appropriate. The vertical axis represents counts. The black bars on the light-gray background have a good data-ink ratio. And it required the 2nd shortest command, only 3 characters longer than the Stata equivalent.

I'm ordering the ggplot2 book (Amazon, ~$50), so as I figure out how to do more with ggplot2 I'll post more comparisons like this. If you use SPSS, SAS, MATLAB, or something else, post the code in a comment here and send me a picture or link to the plot and I'll post it here.

Wednesday, September 16, 2009

PCG Journal Club Articles, 9/11

There were only a couple of citations for articles discussed at this week's PCG meeting (September 11). Our next meeting is scheduled for September 26.

~Julia

Kim S, Xing EP. Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genet. 2009 Aug; 5(8):e1000587.

Zamar D, Tripp B, Ellis G, Daley D. Path: a tool to failitate pathway-based genetic association analysis. Bioinformatics. 2009 Sep 15; 25(18):2444-6

R clinic this week: Regression Modeling Strategies in R

At this week's R clinic Frank Harrell will unveil the new rms (Regression Modeling Strategies) package that is a replacement for the R Design package. He will demonstrate the differences with Design, especially related to enhanced graphics for displaying effects in regression models. Frank will also discuss the implementation of quantile regression in rms. The rms package website has links to the manual, examples of graphical output, and printable reference cards for many of the package's commands. It also makes a point that many of rms's graphics capabilities are modular and will play nicely with previously mentioned ggplot2.

To install the rms package, start R and type:

install.packages("rms", dependencies=TRUE)

Then to load it any time thereafter,

library(rms)

The R clinic is held by the Vanderbilt biostatistics department every Thursday 2-3pm and free to anyone who wants to attend. More information here.

Monday, September 14, 2009

Find the function you're looking for in R

Any R user no matter what level of experience has had trouble finding the package or the function to do what you want to do and then figuring out how to use it. The sos package in R just made that a lot easier.

First, fire up R, then install the sos package (don't omit the quotes):

install.packages("sos")

It'll ask you to choose a mirror. Choose the closest one. After it installs, load the package (omit the quotes this time):

library(sos)

This loaded all the functions that come with the sos package, including a particularly useful one called findFn. It scans the "function" entries in Jonathan Baron's "R site search" database. Give it a try, using "epistasis" with the quotes as the keyword.

findFn("epistasis")

This should open up a web browser that displays relevant functions, the package you need to download (using the above procedure) to use the function, and a link to the help page for that function.

You can also use ??? as an alias for findFn. Try it like this (use the quotes):

???"genome wide"

Once you have the sos package installed, type vignette("sos") for more information on how to use various functions in this package.

If you still can't find what you're looking for, check out my previous post on finding help on R, and if all else fails, don't forget about Theresa Scott's free weekly R clinic / Q&A sessions.

Thursday, September 10, 2009

Machine Learning in R

Revolutions blog recently posted a link to R code by Joshua Reich with self-contained examples of using machine learning techniques in R, including various clustering methods (k-means, nearest neighbor, and kernel), recursive partitioning (CART), principle components analysis, linear discriminant analysis, and support vector machines. This post also links to some slides that go over the basics of machine learning. Looks like a good place to start learning about ML before handrolling your own code.

Be sure to check out one of Will's previous post on hierarchical clustering in R.

Revolutions: Machine learning in R, in a nutshell

Wednesday, September 9, 2009

Sync your home directories on ACCRE and the local Linux servers (a.k.a. "the cheeses")

Vanderbilt ACCRE users with PCs only...

If you use ACCRE to run multi-processor jobs you'll be glad to know that they now allow you to map your home directory to your local desktop using Samba (so you can access your files through My Computer as you normally would with local files). Just submit a help request on their website and they'll get you set up.

Now if you have both your ACCRE home and your home on the cheeses mapped, you can easily sync the files between the two. Download Microsoft's free SyncToy to do the job. It's pretty dead simple to set up, and one click will synchronize files between the two servers.

I didn't want to synchronize everything, so I set it up to only sync directories that contain perl scripts and other programs that I commonly use on both machines. SyncToy also seems pretty useful for backing up your files too.

Microsoft SyncToy

Ask ACCRE to let you map your home

This blog has moved!