Getting Genetics Done: August 2011

Wednesday, August 31, 2011

Personal Genomics and Data Sharing Survey

I was recently contacted by a couple of German biologists working on a project evaluating opinions on sharing raw data from DTC genetic testing companies like 23andme. A handful of people like the gang at Genomes Unzipped, the PGP-10, and others at SNPedia have released their own genotype or sequencing data into the public domain. As of now, data like this is scattered around the web and most of it is not attached to any phenotype data.

These three biologists are working on a website that collects genetic data as well as phenotypic data. The hope is to make it easy to find and access appropriate data and to become a resource for a kind of open-source GWAS - similar to the research 23andMe performs in its walled garden right now.

But because of privacy concerns, many people (myself included) hesitate to freely publish their genetic data for the world to see. These three biologists are conducting a survey to assess how willing people might be to participate in something like this, and for what reasons they would (or would not). The survey can be accessed at http://bit.ly/genotyping_survey. It took about 2 minutes for me to complete, and you can optionally sign up to receive an email with their results once they've completed the survey.

Although I'm still hesitant to participate in something like this myself, I like the idea, and I'm very interested to see the results of their survey. Hit the link below if you'd like to take the quick survey.

Personal Genomics and Data Sharing Survey

Monday, August 29, 2011

Bioinformatics Posters Collection

I mentioned BioStar in a previous post about getting all your questions answered. I can't emphasize enough how helpful the BioStar and other StackExchange communities are. Whenever I ask a statistics question on CrossValidated or a programming question on StackOverflow I often multiple answers within 10 minutes.

Recently there was a question on BioStar from someone making their poster for a bioinformatics poster presentation and wanted some inspiration for design and layout. No less than 7 community members posted responses the same day, linking to sites where you can download poster presentations, including VIZBI 2011 (workshop on visualizing biological data), F1000 Posters (which collects posters from the Intelligent Systems for Molecular Biology conference), Nature Precedings (not specifically limited to bioinformatics), and several others.

While you can see plenty of posters at the meeting you're attending, it isn't much help when you're trying to design and layout your poster beforehand. I've used the same tired old template for poster presentations for years, and it's helpful to see examples of other bioinformatics posters for fresh ideas about design and layout.

I would also encourage you to deposit some of your posters in places like F1000 (deposit link) or Nature Precedings (submission link). While these aren't peer-reviewed, it can really increase the visibility of your work, and it gives you a permanent DOI (at least for Nature Precedings) that you can link to or reference in other scientific communication.

See this Q&A at BioStar for more.

Monday, August 22, 2011

Estimating Trait Heritability from GWAS Data

Peter Visscher and colleagues have recently published a flurry of papers employing a new software package called GCTA to estimate the heritability of traits using GWAS data (GCTA stands for Genome-wide Complex Trait Analysis -- clever acronymity!). The tool, supported (and presumably coded) by Jian Yang is remarkably easy to use, based in part on the familiar PLINK commandline interface. The GCTA Homepage provides an excellent walk-through of the available options.

The basic idea is to use GWAS data to estimate the degree of "genetic sharing" or relatedness among the samples, computing what the authors call a genetic relationship matrix (GRM). The degree of genetic sharing among samples is then related to the amount of phenotypic sharing using restricted maximum likelihood analysis (REML). The result is an estimate of the variance explained by the SNPs used to generate the GRM. Full details of the stats along with all the gory matrix notation can be found in their software publication.

The approach has been applied to several disorders studied by the WTCCC and to a recent study of human height. Interestingly, the developers have also used the approach to partition the trait variance across chromosomes, resulting in something similar to population-based variance-components linkage analysis. The approach works for both quantitative and dichotomous traits, however the authors warn that variance estimates of dichotomous trait liability are influenced by genotyping artifacts.

The package also includes several other handy features, including a relatively easy way to estimate principal components for population structure correction, a GWAS simulation tool, and a regression-based LD mapping tool. Download and play -- a binary is available for Linux, MacOS, and DOS/Windows.

Monday, August 15, 2011

Sync Your Rprofile Across Multiple R Installations

Your Rprofile is a script that R executes every time you launch an R session. You can use it to automatically load packages, set your working directory, set options, define useful functions, and set up database connections, and run any other code you want every time you start R.

If you're using R in Linux, it's a hidden file in your home directory called ~/.Rprofile, and if you're on Windows, it's usually in the program files directory: C:\Program Files\R\R-2.12.2\library\base\R\Rprofile. I sync my Rprofile across several machines and operating systems by creating a separate script called called syncprofile.R and storing this in my Dropbox. Then, on each machine, I edit the real Rprofile to source the syncprofile.R script that resides in my Dropbox.

One of the disadvantages of doing this, however, is that all the functions you define and variables you create are sourced into the global environment (.GlobalEnv). This can clutter your workspace, and if you want to start clean using rm(list=ls(all=TRUE)), you'll have to re-source your syncprofile.R script every time.

It's easy to get around this problem. Rather than simply appending source(/path/to/dropbox/syncprofile.R) to the end of your actual Rprofile, first create a new environment, source that script into that new environment, and attach that new environment. So you'll add this to the end of your real Rprofile on each machine/installation:

my.env <- new.env()

sys.source("C:/Users/st/Dropbox/R/Rprofile.r", my.env)

attach(my.env)

All the functions and variables you've defined are now available but they no longer clutter up the global environment.

If you have code that you only want to run on specific machines, you can still put that into each installation's Rprofile rather than the syncprofile.R script that you sync using Dropbox. Here's what my syncprofile.R script looks like - feel free to take whatever looks useful to you.

Friday, August 5, 2011

Friday Links: R, OpenHelix Bioinformatics Tips, 23andMe, Perl, Python, Next-Gen Sequencing

I haven't posted much here recently, but here is a roundup of a few of the links I've shared on Twitter (@genetics_blog) over the last two weeks.

Here is a nice tutorial on accessing high-throughput public data (from NCBI) using R and Bioconductor.

Cloudnumbers.com, a startup that allows you to run high-performance computing (HPC) applications in the cloud, now supports the previously mentioned R IDE, RStudio.

23andMe announced a project to enroll 10,000 African-Americans for research by giving participants their personal genome service for free. You can read about it here at 23andMe or here at Genetic Future.

Speaking of 23andMe, they emailed me a coupon code (8WR9U9) for getting $50 off their personal genome service, making it $49 instead of $99. Not sure how long it will last.

I previously took a poll which showed that most of you use Mendeley to manage your references. Mendeley recently released version 1.0, which includes some nice features like duplicate detection, better library organization (subfolders!), and a better file organization tool. You can download it here.

An interesting blog post by Michael Barton on how training and experience in bioinformatics leads to a wide set of transferable skills.

Dienekes releases a free DIY admixture program to analyze genomic ancestry.

A few tips from OpenHelix: the new SIB Bioinformatics Resource Portal, and testing correlation between SNPs and gene expression using SNPexp.

A nice animation describing a Circos plot from PacBio's E. coli paper in NEJM.

The Court of Appeals for the Federal Circuit reversed the lower court's invalidation of Myriad Genetics' patents on BRCA1/2, reinstating most of the claims in full force. Thoughtful analysis from Dan Vorhaus here.

Using the Linux shell and perl to delete files in the current directory that don't contain the right number of lines: If you want to get rid of all files in the current directory that don't have exactly 42 lines, run this code at the command line (*be very careful with this one!*): for f in *.txt;do perl -ne 'END{unlink $ARGV unless $.==42}' ${f} ;done

The previously mentioned Hitchhiker's Guide to Next-Generation Sequencing by Gabe Rudy at Golden Helix is now available in PDF format here. You can also find the related post describing all the various file formats used in NGS in PDF format here.

The Washington Post ran an article about the Khan Academy (http://www.khanacademy.org/), which has thousands of free video lectures, mostly on math. There are also a few computer science lectures that teach Python programming. (Salman Khan also appeared on the Colbert Report a few months ago).

Finally, I stumbled across this old question on BioStar with lots of answers about methods for short read mapping with next-generation sequencing data.

...

And here are a few interesting papers I shared:

Nature Biotechnology: Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly

PLoS Genetics: Gene-Based Tests of Association

PLoS Genetics: Fine Mapping of Five Loci Associated with Low-Density Lipoprotein Cholesterol Detects Variants That Double the Explained Heritability

Nature Reviews Genetics: Systems-biology approaches for predicting genomic evolution

Genome Research: A comprehensively molecular haplotype-resolved genome of a European individual (paper about the importance of phase in genetic studies)

Nature Reviews Microbiology: Unravelling the effects of the environment and host genotype on the gut microbiome.
...

This blog has moved!