Getting Genetics Done: 1000 genomes

Showing posts with label 1000 genomes. Show all posts

Tuesday, May 6, 2014

Mycoplasma Contamination in Cell-Line Based Experiments

For a few years now, my EvoSTAR colleague, Bill Langdon, has been exploring the degree to which Mycoplasma bacteria have contaminated experimental systems and even "infected" online databases with the contents of their genomes. He and his colleagues have previously shown that Mycoplasma genome sequences have previously been mislabeled as human sequences in several online resources (GenBank, dbEST, and RefSeq).

Early microarray designs were based largely on ESTs from these resources, and as a result, the Affymetrix HG-U133 plus 2.0 array contains probes for Mycoplasma sequences. Details for these probes can be found here. Exploiting these probes, Bill and colleagues have also examined the Gene Expression Omnibus for evidence of Mycoplasma contamination, and found around 30 studies (roughly 1% of GEO) that show high expression for this probe, the vast majority of which were from cell cultures.

By their proclivity to infect human experimental cell lines, Bill has playfully described Mycoplasma as having evolved the ability to transmit their genes into online databases.

Continuing this pursuit, Bill recently published an article in BMC BioData Mining illustrating Mycoplasma contamination of the 1000 Genomes Project. It is unclear what the implications of this contamination are for the integrity of 1000 Genomes Data, as the majority of identified Mycoplasma reads to not map to the human reference genome. This work should however serve as a bellwether to those performing experiments, or using experimental data from treated cell lines. In these situations, any contamination might severely taint experimental results.

Monday, June 11, 2012

The HaploREG Database for Functional Annotation of SNPs

The ENCODE project continues to generate massive numbers of data points on how genes are regulated. This data will be of incredible use for understanding the role of genetic variation, both for altering low-level cellular phenotypes (like gene expression or splicing), but also for complex disease phenotypes. While it is all deposited into the UCSC browser, ENCODE data is not always the easiest to access or manipulate.

To make epigenomic tracks from the ENCODE project more accessible for interpretation in the context of new or existing GWAS hits, Luke Ward and Manolis Kellis at the BROAD Institute have developed a database called HaploREG. HaploREG uses LD and SNP information from the 1000 Genomes project to map known genetic variants onto ENCODE data, providing a potential mechanism for SNP influence. HaploREG will annotate SNPs with evolutionary constraint measures, predicted chromatin states, and how SNPs alter the Positional Weight Matrices of known transcription factors.

Here's a screenshot from SNP associated with HDL cholesterol levels showing summary information for several SNPs in LD at R2>0.9 in CEU. Clicking each SNP link provides further info.

In addition to providing annotations of user-submitted SNPs, HaploREG also provides cross-references from the NHGRI GWAS Catalog, allowing users to explore the mechanisms behind disease associated SNPs. Check out the site here: http://www.broadinstitute.org/mammals/haploreg/haploreg.php and explore the functionality of any SNPs you might find associated in your work. The more functional information we can include in our manuscripts, the more likely they are to be tested in a model system.

HaploReg: Functional Annotation of SNPs

Tuesday, July 12, 2011

Download 69 Complete Human Genomes

Sequencing company Complete Genomics recently made available 69 ethnically diverse complete human genome sequences: a Yoruba trio; a Puerto Rican trio; a 17-member, 3-generation pedigree; and a diversity panel representing 9 different populations. Some of the samples partially overlap with HapMap and the 1000 Genomes Project. The data can be downloaded directly from the FTP site. See the link below for more details on the directory contents, and have a look at the quick start guide to working with complete genomics data.

Complete Genomics - Sample Human Genome Sequence Data

Wednesday, May 4, 2011

PLINK/SEQ for Analyzing Large-Scale Genome Sequencing Data

PLINK/SEQ is an open source C/C++ library for analyzing large-scale genome sequencing data. The library can be accessed via the pseq command line tool, or through an R interface. The project is developed independently of PLINK but it's syntax will be familiar to PLINK users.

PLINK/SEQ boasts an impressive feature set for a project still in the beta testing phase. It supports several data types (multiallelic, phased, imputation probabilities, indels, and structural variants), and can handle datasets much larger than what can fit into memory. PLINK/SEQ also comes bundled with several reference databases of gene transcripts and sequence & variation projects, including dbSNP and 1000 Genomes Project data.

As with PLINK, the documentation is good, and there's a tutorial using 1000 Genomes Project data.

PLINK/SEQ - A library for the analysis of genetic variation data

Thursday, April 21, 2011

How To Get Allele Frequencies and Create a PED file from 1000 Genomes Data

I recently analyzed some next-generation sequencing data, and I first wanted to compare the frequencies in my samples to those in the 1000 Genomes Project. It turns out this is much easier that I thought, as long as you're a little comfortable with the Linux command line.

First, you'll need a Linux system, and two utilities: tabix and vcftools.

I'm virtualizing an Ubuntu Linux system in Virtualbox on my Windows 7 machine. I had a little trouble compiling vcftools on my Ubuntu system out of the box. Before trying to compile tabix and vcftools I'd recommend installing the GNU C++ compiler and another development version of a compression library, zlib1g-dev. This is easy in Ubuntu. Just enter these commands at the terminal:

sudo apt-get install g++

sudo apt-get install zlib1g-dev

First, download tabix. I'm giving you the direct link to the most current version as of this writing, but you might go to the respective sourceforge pages to get the most recent version yourself. Use tar to unpack the download, go into the unzipped directory, and type "make" to compile the executable.

wget http://sourceforge.net/projects/samtools/files/tabix/tabix-0.2.3.tar.bz2

tar -jxvf tabix-0.2.3.tar.bz2

cd tabix-0.2.3/ 

make

Now do the same thing for vcf tools:

wget http://sourceforge.net/projects/vcftools/files/vcftools_v0.1.4a.tar.gz

tar -zxvf vcftools_v0.1.4a.tar.gz 

cd vcftools_0.1.4a/

make

The vcftools binary will be in the cpp directory. Copy both the tabix and vcftools executables to wherever you want to run your analysis.

Let's say that you wanted to pull all the 1000 genomes data from the CETP gene on chromosome 16, compute allele frequencies, and drop a linkage format PED file so you can look at linkage disequilibrium using Haploview.

First, use tabix to hit the 1000 genomes FTP site, pulling data from the 20080804 release for the CETP region (chr16:56,995,835-57,017,756), and save that output to a file called genotypes.vcf. Because tabix doesn't download the entire 1000 Genomes data and pulls only the sections you need, this is extremely fast. This should take around a minute, depending on your web connection and CPU speeds.

./tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 16:56995835-57017756 > genotypes.vcf

Not too difficult, right? Now use vcftools (which works a lot like plink) to compute allele frequencies. This should take less than one second.

./vcftools --vcf genotypes.vcf --freq --out allelefrequencies

Finally, use vcftools to create a linkage format PED and MAP file that you can use in PLINK or Haploview. This took about 30 seconds for me.

./vcftools --vcf genotypes.vcf --plink --out plinkformat

That's it. It looks like you can also dig around in the supporting directory on the FTP site and pull out genotypes for specific ethnic groups as well (EUR, AFR, and ASN HapMap populations).

Tuesday, November 23, 2010

Randomly Select Subsets of Individuals from a Binary Pedigree .fam File

I'm working on imputing GWAS data to the 1000 Genomes Project data using MaCH. For the model estimation phase you only need ~200 individuals. Here's a one-line unix command that will pull out 200 samples at random from a binary pedigree .fam file called myfamfile.fam:

for i in `cut -d ' ' -f 1-2  myfamfile.fam | sed s/\ /,/g`; do echo "$RANDOM $i"; done | sort |  cut -d' ' -f 2| sed s/,/\ /g | head -n 200

Redirect this output to a file, and then run PLINK using the --keep option with this new file.

Thursday, November 11, 2010

Split up a GWAS dataset (PED/BED) by Chromosome

As I mentioned in my recap of the ASHG 1000 genomes tutorial, I'm doing to be imputing some of my own data to 1000 genomes, and I'll try to post lessons learned along the way here under the 1000 genomes and imputation tags.

I'm starting from a binary pedigree format file (plink's bed/bim/fam format) and the first thing in the 1000 genomes imputation cookbook is to store your data in Merlin format, one per chromosome. Surprisingly there is no option in PLINK to split up a dataset into separate files by chromosome, so I wrote a Perl script to do it myself. The script takes two arguments: 1. the base filename of the binary pedfile (if your files are data.bed, data.bim, data.fam, the base filename will be "data" without the quotes); 2. a base filename for the output files to be split up by chromosome. You'll need PLINK installed for this to work, and I've only tested this on a Unix machine. You can copy the source code below:

Tuesday, November 9, 2010

Video and slides from ASHG 1000 Genomes tutorials

If you missed the tutorial on the 1000 genomes project data last week at ASHG, you can now watch the tutorials on youtube and download the slides online at http://genome.gov/27542240. Here's a recap of the speakers and topics:

Introduction Gil McVean, Ph.D. Professor of Statistical Genetics University of Oxford
Description of the 1000 Genomes Data Gabor Marth, D.Sc. Associate Professor of Biology Boston College
How to Access the Data Steve Sherry, Ph.D. National Center for Biotechnology Information National Library of Medicine National Institutes of Health. Bethesda, Md.
How to Use the Browser Paul Flicek, Ph.D. European Molecular Biology Laboratory Vertebrate Genomics Team European Bioinformatics Institute (EBI)
Stuctural Variants Jan Korbel, Ph.D. Group Leader, Genome Biology Research Unit Joint Appointment with EMBL-EBI European Molecular Biology Laboratory (Heidelberg, Germany)
How to Use the Data in Disease Studies Jeffrey Barrett, Ph.D. Team Leader, Statistical and Computational Genetics Wellcome Trust Sanger Institute Hinxton, United Kingdom

Visit http://genome.gov/27542240 for links to all the videos and slides. I found Jeff Barrett's overview of using the 1000 genomes data for imputation particularly helpful. Also, don't forget about Goncalo Abecasis's 1000 genomes imputation cookbook, which gives a little more detailed information about formatting, parallelizing code, etc. I'm going to be trying this myself soon, and I'll post tips along the way.

Thursday, October 14, 2010

Tutorial on the 1000 Genomes Project Data

There will be a (free) tutorial on the 1000 genomes project at this year's ASHG meeting on Wednesday, November 3, 7:00 – 9:30pm. You can register online at the link below. The tutorial will describe the 1000 genomes data, how to access it, and what to do with it. Specifically, the speakers and topics covered are:

1. Introduction
2. Description of the 1000 Genomes data -- Gabor Marth
3. How to access the data -- Steve Sherry
4. How to use the browser -- Paul Flicek
5. Structural variants -- Jan Korbel
6. How to use the data in disease studies -- Jeff Barrett
7. Q&A

Online registration for 1000 genomes tutorial

Hopefully I'll see some of you there. I'm not sure if imputation is covered in this tutorial. If not, I will cover it here in a future post. I'll soon be using Goncalo Abecasis's 1000 Genomes Imputation Cookbook to impute my own data to the 1kG SNPs, and I'll share any tips I discover along the way.

This blog has moved!