Wednesday, August 31, 2011

Personal Genomics and Data Sharing Survey

I was recently contacted by a couple of German biologists working on a project evaluating opinions on sharing raw data from DTC genetic testing companies like 23andme. A handful of people like the gang at Genomes Unzipped, the PGP-10, and others at SNPedia have released their own genotype or sequencing data into the public domain. As of now, data like this is scattered around the web and most of it is not attached to any phenotype data.

These three biologists are working on a website that collects genetic data as well as phenotypic data. The hope is to make it easy to find and access appropriate data and to become a resource for a kind of open-source GWAS - similar to the research 23andMe performs in its walled garden right now.

But because of privacy concerns, many people (myself included) hesitate to freely publish their genetic data for the world to see. These three biologists are conducting a survey to assess how willing people might be to participate in something like this, and for what reasons they would (or would not). The survey can be accessed at http://bit.ly/genotyping_survey. It took about 2 minutes for me to complete, and you can optionally sign up to receive an email with their results once they've completed the survey.

Although I'm still hesitant to participate in something like this myself, I like the idea, and I'm very interested to see the results of their survey. Hit the link below if you'd like to take the quick survey.

Personal Genomics and Data Sharing Survey

Monday, August 29, 2011

Bioinformatics Posters Collection

I mentioned BioStar in a previous post about getting all your questions answered. I can't emphasize enough how helpful the BioStar and other StackExchange communities are. Whenever I ask a statistics question on CrossValidated or a programming question on StackOverflow I often multiple answers within 10 minutes.

Recently there was a question on BioStar from someone making their poster for a bioinformatics poster presentation and wanted some inspiration for design and layout. No less than 7 community members posted responses the same day, linking to sites where you can download poster presentations, including VIZBI 2011 (workshop on visualizing biological data), F1000 Posters (which collects posters from the Intelligent Systems for Molecular Biology conference), Nature Precedings (not specifically limited to bioinformatics), and several others.

While you can see plenty of posters at the meeting you're attending, it isn't much help when you're trying to design and layout your poster beforehand. I've used the same tired old template for poster presentations for years, and it's helpful to see examples of other bioinformatics posters for fresh ideas about design and layout.

I would also encourage you to deposit some of your posters in places like F1000 (deposit link) or Nature Precedings (submission link). While these aren't peer-reviewed, it can really increase the visibility of your work, and it gives you a permanent DOI (at least for Nature Precedings) that you can link to or reference in other scientific communication.

See this Q&A at BioStar for more.

Monday, August 22, 2011

Estimating Trait Heritability from GWAS Data

Peter Visscher and colleagues have recently published a flurry of papers employing a new software package called GCTA to estimate the heritability of traits using GWAS data (GCTA stands for Genome-wide Complex Trait Analysis -- clever acronymity!). The tool, supported (and presumably coded) by Jian Yang is remarkably easy to use, based in part on the familiar PLINK commandline interface. The GCTA Homepage provides an excellent walk-through of the available options.

The basic idea is to use GWAS data to estimate the degree of "genetic sharing" or relatedness among the samples, computing what the authors call a genetic relationship matrix (GRM). The degree of genetic sharing among samples is then related to the amount of phenotypic sharing using restricted maximum likelihood analysis (REML). The result is an estimate of the variance explained by the SNPs used to generate the GRM. Full details of the stats along with all the gory matrix notation can be found in their software publication.

The approach has been applied to several disorders studied by the WTCCC and to a recent study of human height. Interestingly, the developers have also used the approach to partition the trait variance across chromosomes, resulting in something similar to population-based variance-components linkage analysis. The approach works for both quantitative and dichotomous traits, however the authors warn that variance estimates of dichotomous trait liability are influenced by genotyping artifacts.

The package also includes several other handy features, including a relatively easy way to estimate principal components for population structure correction, a GWAS simulation tool, and a regression-based LD mapping tool. Download and play -- a binary is available for Linux, MacOS, and DOS/Windows.

Monday, August 15, 2011

Sync Your Rprofile Across Multiple R Installations

Your Rprofile is a script that R executes every time you launch an R session. You can use it to automatically load packages, set your working directory, set options, define useful functions, and set up database connections, and run any other code you want every time you start R.

If you're using R in Linux, it's a hidden file in your home directory called ~/.Rprofile, and if you're on Windows, it's usually in the program files directory: C:\Program Files\R\R-2.12.2\library\base\R\Rprofile. I sync my Rprofile across several machines and operating systems by creating a separate script called called syncprofile.R and storing this in my Dropbox. Then, on each machine, I edit the real Rprofile to source the syncprofile.R script that resides in my Dropbox.

One of the disadvantages of doing this, however, is that all the functions you define and variables you create are sourced into the global environment (.GlobalEnv). This can clutter your workspace, and if you want to start clean using rm(list=ls(all=TRUE)), you'll have to re-source your syncprofile.R script every time.

It's easy to get around this problem. Rather than simply appending source(/path/to/dropbox/syncprofile.R) to the end of your actual Rprofile, first create a new environment, source that script into that new environment, and attach that new environment. So you'll add this to the end of your real Rprofile on each machine/installation:

my.env <- new.env()
sys.source("C:/Users/st/Dropbox/R/Rprofile.r", my.env)
attach(my.env)

All the functions and variables you've defined are now available but they no longer clutter up the global environment.

If you have code that you only want to run on specific machines, you can still put that into each installation's Rprofile rather than the syncprofile.R script that you sync using Dropbox. Here's what my syncprofile.R script looks like - feel free to take whatever looks useful to you.

# To source this file into an environment to avoid cluttering the global workspace, put this in Rprofile:
# my.env <- new.env(); sys.source("C:/PathTo/THIS_FILE.r", my.env); attach(my.env)
#-----------------------------------------------------------------------
# Load packages, set options and cwd, set up database connection
#-----------------------------------------------------------------------
## Load packages
# require(ggplot2) #plotting
# require(plyr) #data manipulation
# require(reshape) #data manipulation
# require(sqldf) #manipulate data frame with SQL
## Sets the working directory to C:/R
setwd("~/R")
## Don't show those silly significanct stars
options(show.signif.stars=FALSE)
## Do you want to automatically convert strings to factor variables in a data.frame?
options(stringsAsFactors=TRUE)
## Hard code the US repository for CRAN so it doesn't ask me every time.
r <- getOption("repos")
r["CRAN"] <- "http://cran.us.r-project.org"
options(repos = r)
rm(r)
## Some SQLite stuff I don't use any more because I switched to MySQL
# require(RSQLite)
# channel <- dbConnect(SQLite(), "C:/cygwin/home/sturner/dbs/sdt.sqlite")
# query <- function(...) dbGetQuery(channel,...)
## Set up ODBC connection for MySQL localhost, and make it easy to query a database with query() function.
require(RODBC) # The rprofile script will fail here if you don't have RODBC installed.
channel <- odbcConnect("localhost")
query <- function(...) sqlQuery(channel, ...)
#-----------------------------------------------------------------------
# Functions
#-----------------------------------------------------------------------
## Transpose a numeric data frame with ID in first column
tdf <- function(d) {
row.names(d) <- d[[1]]
d[[1]] <- NULL
d <- as.data.frame(t(d))
d$id <- row.names(d)
d <- cbind(d[ncol(d)], d[-ncol(d)])
row.names(d) <- NULL
d
}
## Convert selected columns of a data frame to factor variables
factorcols <- function(d, ...) lapply(d, function(x) factor(x, ...))
## Returns a logical vector TRUE for elements of X not in Y
"%nin%" <- function(x, y) !(x %in% y)
## Returns names(df) in single column, numbered matrix format.
n <- function(df) matrix(names(df))
## Single character shortcuts for summary() and head().
s <- base::summary
h <- utils::head
## ht==headtail, i.e., show the first and last 10 items of an object
ht <- function(d) rbind(head(d,10),tail(d,10))
## Show the first 5 rows and first 5 columns of a data frame or matrix
hh <- function(d) d[1:5,1:5]
## Read data on clipboard.
read.cb <- function(...) {
ismac <- Sys.info()[1]=="Darwin"
if (!ismac) read.table(file="clipboard", ...)
else read.table(pipe("pbpaste"), ...)
}
## Open current directory on mac
macopen <- function(...) system("open .")
## name("test.png") results in "C:/R/2010-04-20-test.png" if running this in C:/R on April 20 2010.
name <- function(filename="filename") paste(getwd(),"/",Sys.Date(),"-",filename,sep="")
## Takes a dataframe and a column name, and moves that column to the front of the DF.
moveColFront <- function(d=dataframe, colname="colname") {
index <- match(colname, names(d))
cbind(d[index],d[-index])
}
## Permutes a column in a data.frame, sets seed optionally
permute <- function (dataframe, columnToPermute="column", seed=NULL) {
if (!is.null(seed)) set.seed(seed)
colindex <- which(names(dataframe)==columnToPermute)
permutedcol <- dataframe[ ,colindex][sample(1:nrow(dataframe))]
dataframe[colindex] <- permutedcol
return(dataframe)
}
## Summarize missing data in a data frame. Return a list (lpropmiss) or data frame (propmiss)
lpropmiss <- function(dataframe) lapply(dataframe,function(x) data.frame(nmiss=sum(is.na(x)), n=length(x), propmiss=sum(is.na(x))/length(x)))
propmiss <- function(dataframe) {
m <- sapply(dataframe, function(x) {
data.frame(
nmiss=sum(is.na(x)),
n=length(x),
propmiss=sum(is.na(x))/length(x)
)
})
d <- data.frame(t(m))
d <- sapply(d, unlist)
d <- as.data.frame(d)
d$variable <- row.names(d)
row.names(d) <- NULL
d <- cbind(d[ncol(d)],d[-ncol(d)])
return(d[order(d$propmiss), ])
}
## Make a pretty QQ plot of p-values
qq = function(pvector, ...) {
if (!is.numeric(pvector)) stop("D'oh! P value vector is not numeric.")
pvector <- pvector[!is.na(pvector) & pvector<1 & pvector>0]
o = -log10(sort(pvector,decreasing=F))
#e = -log10( 1:length(o)/length(o) )
e = -log10( ppoints(length(pvector) ))
plot(e,o,pch=19,cex=1, xlab=expression(Expected~~-log[10](italic(p))), ylab=expression(Observed~~-log[10](italic(p))), xlim=c(0,max(e)), ylim=c(0,max(o)), ...)
abline(0,1,col="red")
}
## Draw a histogram with normal overlay (From http://www.statmethods.net/graphs/density.html)
histnormal <- function(d, main=NULL, xlab=NULL, breaks="FD", ...) {
if (any(is.na(d))) warning(paste(sum(is.na(d)), "missing values")); d <- na.omit(d)
h <- hist(d, plot=FALSE, breaks=breaks, ...)
x <- seq(min(d), max(d), length=40)
y <- dnorm(x, mean=mean(d), sd=sd(d))
y <- y*diff(h$mids[1:2])*length(d)
hist(d, col="gray50", main=main, xlab=xlab, ylim=c(0,max(y)), breaks=breaks,...)
lines(x,y, col="blue", lwd=2)
rug(x)
}
## Draw a histogram with density overlay
histdensity <- function(x, main=NULL, breaks="FD", ...) {
if (any(is.na(x))) warning(paste(sum(is.na(x)), "missing values")); x <- na.omit(x)
hist(x, col="gray50", probability=TRUE, breaks=breaks, main=main, ...)
lines(density(x, na.rm = TRUE), col = "blue", lwd=2)
rug(x)
}
## Plot scatterplot with trendline and confidence interval (From http://tinyurl.com/3bvrth7)
scatterci <- function(x, y, ...) {
plot(x, y, ...)
mylm <- lm(y~x)
abline(mylm, col="blue")
x=sort(x)
prd<-predict(mylm,newdata=data.frame(x=x),interval = c("confidence"), level = 0.95)
lines(x,prd[,2],col="blue",lty=3)
lines(x,prd[,3],col="blue",lty=3)
}
## Get the proportion variation explained. See this website for more details: http://goo.gl/jte8X
rsq <- function(predicted, actual) 1-sum((actual-predicted)^2)/sum((actual-mean(actual))^2)
## Correlation matrix with p-values. See http://goo.gl/nahmV for documentation of this function
cor.prob <- function(X, dfr = nrow(X) - 2) {
R <- cor(X)
above <- row(R) < col(R)
r2 <- R[above]^2
Fstat <- r2 * dfr / (1 - r2)
R[above] <- 1 - pf(Fstat, 1, dfr)
R[row(R)==col(R)]<-NA
R
}
## This function accepts a GLM object and does a LR chi-square test on the fit.
lrt <- function (modelobject) {
lrtest.chi2 <- model$null.deviance - model$deviance # Difference in deviance between model with intercept only and full model. This is the likelihood ratio test statistic (-2(log(L))).
lrtest.df <- model$df.null - model$df.residual # Difference in DF. Make sure this equals the number of predictors in the model!
fitpval <- 1-pchisq(lrtest.chi2,lrtest.df)
cat("Likelihood ratio test on model fit:\n\n")
data.frame(lrtest.chi2=lrtest.chi2,lrtest.df=lrtest.df,fitpval=fitpval) #Output gives you the chisquare, df, and p-value.
}
## This function does the same thing as lrtest in the Design package, but doesn't do as much checking.
## Remember, the lrt has to test the same model (model fit on same observations)
## Also the drop1(fullmodel,test="Chisq") does something similar.
lrt2 <- function (full,reduced) {
if (reduced$deviance<=full$deviance) stop ("Reduced model not worse than full.")
if (reduced$df.residual<=full$df.residual) stop ("Reduced model doesn't have more degrees of freedom.")
lrtest.chi2 <- reduced$deviance-full$deviance
lrtest.df <- reduced$df.residual - full$df.residual
fitpval <- 1-pchisq(lrtest.chi2,lrtest.df)
cat("Likelihood ratio test on two models:\n\n")
data.frame(lrtest.chi2=lrtest.chi2,lrtest.df=lrtest.df,fitpval=fitpval)
}
## This gets the overall anova p-value out of a linear model object
lmp <- function (modelobject) {
if (class(modelobject) != "lm") stop("Not an object of class 'lm' ")
f <- summary(modelobject)$fstatistic
p <- pf(f[1],f[2],f[3],lower.tail=F)
attributes(p) <- NULL
return(p)
}
## Function for arranging ggplots. use png(); arrange(p1, p2, ncol=1); dev.off() to save.
vp.layout <- function(x, y) viewport(layout.pos.row=x, layout.pos.col=y)
arrange_ggplot2 <- function(..., nrow=NULL, ncol=NULL, as.table=FALSE) {
dots <- list(...)
n <- length(dots)
if(is.null(nrow) & is.null(ncol)) { nrow = floor(n/2) ; ncol = ceiling(n/nrow)}
if(is.null(nrow)) { nrow = ceiling(n/ncol)}
if(is.null(ncol)) { ncol = ceiling(n/nrow)}
## NOTE see n2mfrow in grDevices for possible alternative
grid.newpage()
pushViewport(viewport(layout=grid.layout(nrow,ncol) ) )
ii.p <- 1
for(ii.row in seq(1, nrow)){
ii.table.row <- ii.row
if(as.table) {ii.table.row <- nrow - ii.table.row + 1}
for(ii.col in seq(1, ncol)){
ii.table <- ii.p
if(ii.p > n) break
print(dots[[ii.table]], vp=vp.layout(ii.table.row, ii.col))
ii.p <- ii.p + 1
}
}
}
## Imputes the median value of a vector, matrix, or data frame.
## Stolen from na.roughfix function in the randomForest package.
na.roughfix <- function (object=NULL, ...) {
if (class(object) == "data.frame") {
isfac <- sapply(object, is.factor)
isnum <- sapply(object, is.numeric)
if (any(!(isfac | isnum)))
stop("dfMedianImpute only works for numeric or factor")
roughfix <- function(x) {
if (any(is.na(x))) {
if (is.factor(x)) {
freq <- table(x)
x[is.na(x)] <- names(freq)[which.max(freq)]
}
else {
x[is.na(x)] <- median(x, na.rm = TRUE)
}
}
x
}
object[] <- lapply(object, roughfix)
return(object)
}
else if(is.atomic(object)) {
d <- dim(object)
if (length(d) > 2)
stop("vectorMedianImpute can't handle objects with more than two dimensions")
if (all(!is.na(object)))
return(object)
if (!is.numeric(object))
stop("vectorMedianImpute can only deal with numeric data.")
if (length(d) == 2) {
hasNA <- which(apply(object, 2, function(x) any(is.na(x))))
for (j in hasNA) object[is.na(object[, j]), j] <- median(object[,
j], na.rm = TRUE)
}
else {
object[is.na(object)] <- median(object, na.rm = TRUE)
}
return(object)
}
else stop("Object is not a data frame or atomic vector")
}
## Makes a better scatterplot matrix.
## Stolen from the PerformanceAnalytics package: http://cran.r-project.org/web/packages/PerformanceAnalytics/index.html
## Also see http://moderntoolmaking.blogspot.com/2011/08/graphically-analyzing-variable.html
## To color code points based on levels of a factor, use these args:
## pairs.perfan(d, bg=c("red","blue")[d$factor], pch=21)
betterpairs <- function (R, histogram = TRUE, ...)
{
x=as.matrix(R) # in PerformanceAnalytics: x = checkData(R, method = "matrix")
if (mode(x)!="numeric") stop("Must pass in only numeric values")
panel.cor <- function(x, y, digits = 2, prefix = "", use = "pairwise.complete.obs", cex.cor, ...) {
usr <- par("usr")
on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y, use = use))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste(prefix, txt, sep = "")
if (missing(cex.cor)) cex <- 0.8/strwidth(txt)
test <- cor.test(x, y)
Signif <- symnum(test$p.value, corr = FALSE, na = FALSE, cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1), symbols = c("***", "**", "*", ".", " "))
text(0.5, 0.5, txt, cex = cex * r)
text(0.8, 0.8, Signif, cex = cex, col = 2)
}
f <- function(t) dnorm(t, mean = mean(x), sd = sd(x))
# Useful function for histogram showing density overlay and rug
hist.panel = function(x, ...) {
par(new = TRUE)
hist(x, col = "light gray", probability = TRUE, axes = FALSE, main = "", breaks = "FD")
lines(density(x, na.rm = TRUE), col = "red", lwd = 1)
rug(x)
}
if (histogram) pairs(x, gap = 0, lower.panel = panel.smooth, upper.panel = panel.cor, diag.panel = hist.panel, ...)
else pairs(x, gap = 0, lower.panel = panel.smooth, upper.panel = panel.cor, ...)
}
# Did you make it this far?
message("\n******************************\nSuccessfully loaded Rprofile.r\n******************************")
view raw Rprofile.R hosted with ❤ by GitHub

Friday, August 5, 2011

Friday Links: R, OpenHelix Bioinformatics Tips, 23andMe, Perl, Python, Next-Gen Sequencing

I haven't posted much here recently, but here is a roundup of a few of the links I've shared on Twitter (@genetics_blog) over the last two weeks.

Here is a nice tutorial on accessing high-throughput public data (from NCBI) using R and Bioconductor.

Cloudnumbers.com, a startup that allows you to run high-performance computing (HPC) applications in the cloud, now supports the previously mentioned R IDE, RStudio.

23andMe announced a project to enroll 10,000 African-Americans for research by giving participants their personal genome service for free. You can read about it here at 23andMe or here at Genetic Future.

Speaking of 23andMe, they emailed me a coupon code (8WR9U9) for getting $50 off their personal genome service, making it $49 instead of $99. Not sure how long it will last.

I previously took a poll which showed that most of you use Mendeley to manage your references. Mendeley recently released version 1.0, which includes some nice features like duplicate detection, better library organization (subfolders!), and a better file organization tool. You can download it here.

An interesting blog post by Michael Barton on how training and experience in bioinformatics leads to a wide set of transferable skills.

Dienekes releases a free DIY admixture program to analyze genomic ancestry.

A few tips from OpenHelix: the new SIB Bioinformatics Resource Portal, and testing correlation between SNPs and gene expression using SNPexp.

A nice animation describing a Circos plot from PacBio's E. coli paper in NEJM.

The Court of Appeals for the Federal Circuit reversed the lower court's invalidation of Myriad Genetics' patents on BRCA1/2, reinstating most of the claims in full force. Thoughtful analysis from Dan Vorhaus here.

Using the Linux shell and perl to delete files in the current directory that don't contain the right number of lines: If you want to get rid of all files in the current directory that don't have exactly 42 lines, run this code at the command line (*be very careful with this one!*): for f in *.txt;do perl -ne 'END{unlink $ARGV unless $.==42}' ${f} ;done

The previously mentioned Hitchhiker's Guide to Next-Generation Sequencing by Gabe Rudy at Golden Helix is now available in PDF format here. You can also find the related post describing all the various file formats used in NGS in PDF format here.

The Washington Post ran an article about the Khan Academy (http://www.khanacademy.org/), which has thousands of free video lectures, mostly on math. There are also a few computer science lectures that teach Python programming. (Salman Khan also appeared on the Colbert Report a few months ago).

Finally, I stumbled across this old question on BioStar with lots of answers about methods for short read mapping with next-generation sequencing data.

...

And here are a few interesting papers I shared:

Nature Biotechnology: Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly

PLoS Genetics: Gene-Based Tests of Association

PLoS Genetics: Fine Mapping of Five Loci Associated with Low-Density Lipoprotein Cholesterol Detects Variants That Double the Explained Heritability

Nature Reviews Genetics: Systems-biology approaches for predicting genomic evolution


Genome Research: A comprehensively molecular haplotype-resolved genome of a European individual (paper about the importance of phase in genetic studies)

Nature Reviews Microbiology: Unravelling the effects of the environment and host genotype on the gut microbiome.
...
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.