Getting Genetics Done

Wednesday, August 31, 2011

Personal Genomics and Data Sharing Survey

I was recently contacted by a couple of German biologists working on a project evaluating opinions on sharing raw data from DTC genetic testing companies like 23andme. A handful of people like the gang at Genomes Unzipped, the PGP-10, and others at SNPedia have released their own genotype or sequencing data into the public domain. As of now, data like this is scattered around the web and most of it is not attached to any phenotype data.

These three biologists are working on a website that collects genetic data as well as phenotypic data. The hope is to make it easy to find and access appropriate data and to become a resource for a kind of open-source GWAS - similar to the research 23andMe performs in its walled garden right now.

But because of privacy concerns, many people (myself included) hesitate to freely publish their genetic data for the world to see. These three biologists are conducting a survey to assess how willing people might be to participate in something like this, and for what reasons they would (or would not). The survey can be accessed at http://bit.ly/genotyping_survey. It took about 2 minutes for me to complete, and you can optionally sign up to receive an email with their results once they've completed the survey.

Although I'm still hesitant to participate in something like this myself, I like the idea, and I'm very interested to see the results of their survey. Hit the link below if you'd like to take the quick survey.

Personal Genomics and Data Sharing Survey

Monday, August 29, 2011

Bioinformatics Posters Collection

I mentioned BioStar in a previous post about getting all your questions answered. I can't emphasize enough how helpful the BioStar and other StackExchange communities are. Whenever I ask a statistics question on CrossValidated or a programming question on StackOverflow I often multiple answers within 10 minutes.

Recently there was a question on BioStar from someone making their poster for a bioinformatics poster presentation and wanted some inspiration for design and layout. No less than 7 community members posted responses the same day, linking to sites where you can download poster presentations, including VIZBI 2011 (workshop on visualizing biological data), F1000 Posters (which collects posters from the Intelligent Systems for Molecular Biology conference), Nature Precedings (not specifically limited to bioinformatics), and several others.

While you can see plenty of posters at the meeting you're attending, it isn't much help when you're trying to design and layout your poster beforehand. I've used the same tired old template for poster presentations for years, and it's helpful to see examples of other bioinformatics posters for fresh ideas about design and layout.

I would also encourage you to deposit some of your posters in places like F1000 (deposit link) or Nature Precedings (submission link). While these aren't peer-reviewed, it can really increase the visibility of your work, and it gives you a permanent DOI (at least for Nature Precedings) that you can link to or reference in other scientific communication.

See this Q&A at BioStar for more.

Monday, August 22, 2011

Estimating Trait Heritability from GWAS Data

Peter Visscher and colleagues have recently published a flurry of papers employing a new software package called GCTA to estimate the heritability of traits using GWAS data (GCTA stands for Genome-wide Complex Trait Analysis -- clever acronymity!). The tool, supported (and presumably coded) by Jian Yang is remarkably easy to use, based in part on the familiar PLINK commandline interface. The GCTA Homepage provides an excellent walk-through of the available options.

The basic idea is to use GWAS data to estimate the degree of "genetic sharing" or relatedness among the samples, computing what the authors call a genetic relationship matrix (GRM). The degree of genetic sharing among samples is then related to the amount of phenotypic sharing using restricted maximum likelihood analysis (REML). The result is an estimate of the variance explained by the SNPs used to generate the GRM. Full details of the stats along with all the gory matrix notation can be found in their software publication.

The approach has been applied to several disorders studied by the WTCCC and to a recent study of human height. Interestingly, the developers have also used the approach to partition the trait variance across chromosomes, resulting in something similar to population-based variance-components linkage analysis. The approach works for both quantitative and dichotomous traits, however the authors warn that variance estimates of dichotomous trait liability are influenced by genotyping artifacts.

The package also includes several other handy features, including a relatively easy way to estimate principal components for population structure correction, a GWAS simulation tool, and a regression-based LD mapping tool. Download and play -- a binary is available for Linux, MacOS, and DOS/Windows.

Monday, August 15, 2011

Sync Your Rprofile Across Multiple R Installations

Your Rprofile is a script that R executes every time you launch an R session. You can use it to automatically load packages, set your working directory, set options, define useful functions, and set up database connections, and run any other code you want every time you start R.

If you're using R in Linux, it's a hidden file in your home directory called ~/.Rprofile, and if you're on Windows, it's usually in the program files directory: C:\Program Files\R\R-2.12.2\library\base\R\Rprofile. I sync my Rprofile across several machines and operating systems by creating a separate script called called syncprofile.R and storing this in my Dropbox. Then, on each machine, I edit the real Rprofile to source the syncprofile.R script that resides in my Dropbox.

One of the disadvantages of doing this, however, is that all the functions you define and variables you create are sourced into the global environment (.GlobalEnv). This can clutter your workspace, and if you want to start clean using rm(list=ls(all=TRUE)), you'll have to re-source your syncprofile.R script every time.

It's easy to get around this problem. Rather than simply appending source(/path/to/dropbox/syncprofile.R) to the end of your actual Rprofile, first create a new environment, source that script into that new environment, and attach that new environment. So you'll add this to the end of your real Rprofile on each machine/installation:

my.env <- new.env()

sys.source("C:/Users/st/Dropbox/R/Rprofile.r", my.env)

attach(my.env)

All the functions and variables you've defined are now available but they no longer clutter up the global environment.

If you have code that you only want to run on specific machines, you can still put that into each installation's Rprofile rather than the syncprofile.R script that you sync using Dropbox. Here's what my syncprofile.R script looks like - feel free to take whatever looks useful to you.

Friday, August 5, 2011

Friday Links: R, OpenHelix Bioinformatics Tips, 23andMe, Perl, Python, Next-Gen Sequencing

I haven't posted much here recently, but here is a roundup of a few of the links I've shared on Twitter (@genetics_blog) over the last two weeks.

Here is a nice tutorial on accessing high-throughput public data (from NCBI) using R and Bioconductor.

Cloudnumbers.com, a startup that allows you to run high-performance computing (HPC) applications in the cloud, now supports the previously mentioned R IDE, RStudio.

23andMe announced a project to enroll 10,000 African-Americans for research by giving participants their personal genome service for free. You can read about it here at 23andMe or here at Genetic Future.

Speaking of 23andMe, they emailed me a coupon code (8WR9U9) for getting $50 off their personal genome service, making it $49 instead of $99. Not sure how long it will last.

I previously took a poll which showed that most of you use Mendeley to manage your references. Mendeley recently released version 1.0, which includes some nice features like duplicate detection, better library organization (subfolders!), and a better file organization tool. You can download it here.

An interesting blog post by Michael Barton on how training and experience in bioinformatics leads to a wide set of transferable skills.

Dienekes releases a free DIY admixture program to analyze genomic ancestry.

A few tips from OpenHelix: the new SIB Bioinformatics Resource Portal, and testing correlation between SNPs and gene expression using SNPexp.

A nice animation describing a Circos plot from PacBio's E. coli paper in NEJM.

The Court of Appeals for the Federal Circuit reversed the lower court's invalidation of Myriad Genetics' patents on BRCA1/2, reinstating most of the claims in full force. Thoughtful analysis from Dan Vorhaus here.

Using the Linux shell and perl to delete files in the current directory that don't contain the right number of lines: If you want to get rid of all files in the current directory that don't have exactly 42 lines, run this code at the command line (*be very careful with this one!*): for f in *.txt;do perl -ne 'END{unlink $ARGV unless $.==42}' ${f} ;done

The previously mentioned Hitchhiker's Guide to Next-Generation Sequencing by Gabe Rudy at Golden Helix is now available in PDF format here. You can also find the related post describing all the various file formats used in NGS in PDF format here.

The Washington Post ran an article about the Khan Academy (http://www.khanacademy.org/), which has thousands of free video lectures, mostly on math. There are also a few computer science lectures that teach Python programming. (Salman Khan also appeared on the Colbert Report a few months ago).

Finally, I stumbled across this old question on BioStar with lots of answers about methods for short read mapping with next-generation sequencing data.

...

And here are a few interesting papers I shared:

Nature Biotechnology: Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly

PLoS Genetics: Gene-Based Tests of Association

PLoS Genetics: Fine Mapping of Five Loci Associated with Low-Density Lipoprotein Cholesterol Detects Variants That Double the Explained Heritability

Nature Reviews Genetics: Systems-biology approaches for predicting genomic evolution

Genome Research: A comprehensively molecular haplotype-resolved genome of a European individual (paper about the importance of phase in genetic studies)

Nature Reviews Microbiology: Unravelling the effects of the environment and host genotype on the gut microbiome.
...

Monday, July 25, 2011

Scatterplot matrices in R

I just discovered a handy function in R to produce a scatterplot matrix of selected variables in a dataset. The base graphics function is pairs(). Producing these plots can be helpful in exploring your data, especially using the second method below.

Try it out on the built in iris dataset. (data set gives the measurements in cm of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica).

# Load the iris dataset.
data(iris)
 
# Plot #1: Basic scatterplot matrix of the four measurements
pairs(~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, data=iris)

Looking at the pairs help page I found that there's another built-in function, panel.smooth(), that can be used to plot a loess curve for each plot in a scatterplot matrix. Pass this function to the lower.panel argument of the pairs function. The panel.cor() function below can compute the absolute correlation between pairs of variables, and display these in the upper panels, with the font size proportional to the absolute value of the correlation.

# panel.smooth function is built in.
# panel.cor puts correlation in upper panels, size proportional to correlation
panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits=digits)[1]
    txt <- paste(prefix, txt, sep="")
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}
 
# Plot #2: same as above, but add loess smoother in lower and correlation in upper
pairs(~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, data=iris,
      lower.panel=panel.smooth, upper.panel=panel.cor, 
      pch=20, main="Iris Scatterplot Matrix")

Finally, you can produce a similar plot using ggplot2, with the diagonal showing the kernel density.

# Plot #3: similar plot using ggplot2
# install.packages("ggplot2") ## uncomment to install ggplot2
library(ggplot2)
plotmatrix(with(iris, data.frame(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)))

See more on the pairs function here.

...

Update: A tip of the hat to Hadley Wickham (@hadleywickham) for pointing out two packages useful for scatterplot matrices. The gpairs package has some useful functionality for showing the relationship between both continuous and categorical variables in a dataset, and the GGally package extends ggplot2 for plot matrices.

Tuesday, July 12, 2011

Download 69 Complete Human Genomes

Sequencing company Complete Genomics recently made available 69 ethnically diverse complete human genome sequences: a Yoruba trio; a Puerto Rican trio; a 17-member, 3-generation pedigree; and a diversity panel representing 9 different populations. Some of the samples partially overlap with HapMap and the 1000 Genomes Project. The data can be downloaded directly from the FTP site. See the link below for more details on the directory contents, and have a look at the quick start guide to working with complete genomics data.

Complete Genomics - Sample Human Genome Sequence Data

Thursday, June 30, 2011

Mapping SNPs to Genes for GWAS Enrichment Analysis

There are several tools available for conducting a post-hoc analysis of GWAS data looking for enrichment of significant SNPs using literature or pathway based resources. Examples include GRAIL, ALLIGATOR, and WebGestalt among others (see SNPath R Package). Since gene enrichment and pathway analysis essentially evolved from methods for analyzing gene expression data, many of these tools require specific gene identifiers as input.

To prep GWAS data for analyses such as these, you'll need to ensure that your SNPs have current map position information, and assign each SNP to genes. To update your map information, the easiest resource to use is the UCSC Tables browser.

To use UCSC, go to the TABLES link on the browser page. Select the current assembly, "Variation and Repeats" for group, "Common SNPs(132)" for the track, and "snp132Commmon" for the table. Now you'll need to upload your SNP list. You can manually edit the map file using a tool like Excel or Linux commands like awk to extract the rs_ids for your SNPs into a text file. Upload this file by clicking the "Upload List" button in the "identifiers" section. Once this file has uploaded, select "BED - browser extensible data" for the output format and click "Get Output". The interface will ask you if you want to add upstream or downstream regions -- for SNPs this makes little sense, so leave these at zero. Also, ensure that "Send to Galaxy" is not checked.

This will export a BED file containing the positions of all your SNPs in the build you selected, preferably the most current one -- let's call it SNPs.bed. BED files were developed by the folks behind the UCSC browser to document genomic annotations. The basic format is:

Chromosome|StartPosition|EndPosition|FeatureLabel

Because BED files were designed to list genomic ranges, SNPs are listed with a range of 1 BP, meaning the StartPosition is the location of the SNP and the EndPosition is the location of the SNP + 1.

The next step is to map these over to genes. To save a lot of effort resolving identifiers (which is no fun, I promise), I have generated a list of ~17,000 genes that are consistently annotated across Ensembl and Entrez-gene databases, and which have HUGO identifiers. These files are available here:

Hugo Genes
Entrez Genes
Ensembl Genes

The genes listed in these files were generated by comparing the cross-references between the Ensembl and Entrez-gene databases. To arrive at a set of "consensus" genes, I selected only genes where Ensembl refers to an Entrez-gene with the same coordinates, and that Entrez-gene entry refers back to the same Ensembl gene. Nearly all cases of inconsistent cross-referencing are genes annotated based on electronic predictions, or genes with multiple or inconsistent mappings. For the purpose of doing enrichment analysis, these would be ignored anyway, as they aren't likely to show up in pathway or functional databases. For these genes, we then obtained the HUGO approved gene identifier. The coordinates for all genes are mapped using hg19/GRChB37.

BED files are plain text and easy to modify. If you wanted to add 10KB to the gene bounds, for example, you could alter these files using excel formulas, or with a simple awk command:

awk '{printf("%d\t%d\t%d\t%s\n", $1, $2-10000, $3+10000, $4); }' ensemblgenes.bed

You'll need to download a collection of utilities called http://www.blogger.com/img/blank.gif BEDTools. If you have access to a Linux system, this should be fairly straightforward. These utilities are already installed on Vanderbilt's Linux application servers, but to compile the program yourself, download the .tar.gz file above, and use the following commands to compile:

tar -zxvf BEDTools..tar.gz
cd BEDTools
make clean
make all
ls bin

# copy the binaries to a directory in your PATH. e.g.,
sudo cp bin/* /usr/local/bin

Once you get BEDTools installed, there is a utility called intersectBed. This command allows you to examine the overlap of two BED files. Since we have our collection of SNPs and our collections of Genes in BED file format already, all we need to do is run the command.

intersectBed -a SNPs.bed -b entrezgenes.bed -wa -wb

This will print to the screen any SNP that falls within an EntrezGene boundary, along with the coordinates from each entry in each file. To strip this down to only RS numbers and gene identifiers, we can pipe this through awk.

intersectBed -a SNPs.bed -b entrezgenes.bed -wa -wb | awk '{printf("%s\t%s\n", $4, $8);}' 

This will produce a tab-delimited file with SNPs that fall within genes identified by EntrezGene ID number. For HUGO names or Ensembl gene IDs, use the corresponding BED file.

This blog has moved!