For a few years now, my EvoSTAR colleague, Bill Langdon, has been exploring the degree to which Mycoplasma bacteria have contaminated experimental systems and even "infected" online databases with the contents of their genomes. He and his colleagues have previously shown that Mycoplasma genome sequences have previously been mislabeled as human sequences in several online resources (GenBank, dbEST, and RefSeq).
Early microarray designs were based largely on ESTs from these resources, and as a result, the Affymetrix HG-U133 plus 2.0 array contains probes for Mycoplasma sequences. Details for these probes can be found here. Exploiting these probes, Bill and colleagues have also examined the Gene Expression Omnibus for evidence of Mycoplasma contamination, and found around 30 studies (roughly 1% of GEO) that show high expression for this probe, the vast majority of which were from cell cultures.
By their proclivity to infect human experimental cell lines, Bill has playfully described Mycoplasma as having evolved the ability to transmit their genes into online databases.
Continuing this pursuit, Bill recently published an article in BMC BioData Mining illustrating Mycoplasma contamination of the 1000 Genomes Project. It is unclear what the implications of this contamination are for the integrity of 1000 Genomes Data, as the majority of identified Mycoplasma reads to not map to the human reference genome. This work should however serve as a bellwether to those performing experiments, or using experimental data from treated cell lines. In these situations, any contamination might severely taint experimental results.
Showing posts with label 1000 genomes. Show all posts
Showing posts with label 1000 genomes. Show all posts
Tuesday, May 6, 2014
Monday, June 11, 2012
The HaploREG Database for Functional Annotation of SNPs
The ENCODE project continues to generate massive numbers of
data points on how genes are regulated. This
data will be of incredible use for understanding the role of genetic variation,
both for altering low-level cellular phenotypes (like gene expression or
splicing), but also for complex disease phenotypes. While it is all deposited into the UCSC
browser, ENCODE data is not always the easiest to access or manipulate.
To make epigenomic tracks from the ENCODE project more
accessible for interpretation in the context of new or existing GWAS hits, Luke
Ward and Manolis Kellis at the BROAD Institute have developed a database called
HaploREG. HaploREG uses LD and SNP
information from the 1000 Genomes project to map known genetic variants onto
ENCODE data, providing a potential mechanism for SNP influence. HaploREG will annotate SNPs with evolutionary
constraint measures, predicted chromatin states, and how SNPs alter the
Positional Weight Matrices of known transcription factors.
Here's a screenshot from SNP associated with HDL cholesterol levels showing summary information for several SNPs in LD at R2>0.9 in CEU. Clicking each SNP link provides further info.
Here's a screenshot from SNP associated with HDL cholesterol levels showing summary information for several SNPs in LD at R2>0.9 in CEU. Clicking each SNP link provides further info.
In addition to providing annotations of user-submitted SNPs,
HaploREG also provides cross-references from the NHGRI GWAS Catalog, allowing
users to explore the mechanisms behind disease associated SNPs. Check out the site here: http://www.broadinstitute.org/mammals/haploreg/haploreg.php
and explore the functionality of any SNPs you might find associated in your
work. The more functional information we
can include in our manuscripts, the more likely they are to be tested in a
model system.
HaploReg: Functional Annotation of SNPs
HaploReg: Functional Annotation of SNPs
Tags:
1000 genomes,
Bioinformatics,
Databases,
ENCODE
Tuesday, July 12, 2011
Download 69 Complete Human Genomes
Sequencing company Complete Genomics recently made available 69 ethnically diverse complete human genome sequences: a Yoruba trio; a Puerto Rican trio; a 17-member, 3-generation pedigree; and a diversity panel representing 9 different populations. Some of the samples partially overlap with HapMap and the 1000 Genomes Project. The data can be downloaded directly from the FTP site. See the link below for more details on the directory contents, and have a look at the quick start guide to working with complete genomics data.
Complete Genomics - Sample Human Genome Sequence Data
Complete Genomics - Sample Human Genome Sequence Data
Tags:
1000 genomes,
Sequencing
Wednesday, May 4, 2011
PLINK/SEQ for Analyzing Large-Scale Genome Sequencing Data
PLINK/SEQ is an open source C/C++ library for analyzing large-scale genome sequencing data. The library can be accessed via the pseq command line tool, or through an R interface. The project is developed independently of PLINK but it's syntax will be familiar to PLINK users.
PLINK/SEQ boasts an impressive feature set for a project still in the beta testing phase. It supports several data types (multiallelic, phased, imputation probabilities, indels, and structural variants), and can handle datasets much larger than what can fit into memory. PLINK/SEQ also comes bundled with several reference databases of gene transcripts and sequence & variation projects, including dbSNP and 1000 Genomes Project data.
As with PLINK, the documentation is good, and there's a tutorial using 1000 Genomes Project data.
PLINK/SEQ - A library for the analysis of genetic variation data
PLINK/SEQ boasts an impressive feature set for a project still in the beta testing phase. It supports several data types (multiallelic, phased, imputation probabilities, indels, and structural variants), and can handle datasets much larger than what can fit into memory. PLINK/SEQ also comes bundled with several reference databases of gene transcripts and sequence & variation projects, including dbSNP and 1000 Genomes Project data.
As with PLINK, the documentation is good, and there's a tutorial using 1000 Genomes Project data.
PLINK/SEQ - A library for the analysis of genetic variation data
Tags:
1000 genomes,
Bioinformatics,
GWAS,
R,
Sequencing,
Software
Thursday, April 21, 2011
How To Get Allele Frequencies and Create a PED file from 1000 Genomes Data
I recently analyzed some next-generation sequencing data, and I first wanted to compare the frequencies in my samples to those in the 1000 Genomes Project. It turns out this is much easier that I thought, as long as you're a little comfortable with the Linux command line.
First, you'll need a Linux system, and two utilities: tabix and vcftools.
I'm virtualizing an Ubuntu Linux system in Virtualbox on my Windows 7 machine. I had a little trouble compiling vcftools on my Ubuntu system out of the box. Before trying to compile tabix and vcftools I'd recommend installing the GNU C++ compiler and another development version of a compression library, zlib1g-dev. This is easy in Ubuntu. Just enter these commands at the terminal:
First, download tabix. I'm giving you the direct link to the most current version as of this writing, but you might go to the respective sourceforge pages to get the most recent version yourself. Use tar to unpack the download, go into the unzipped directory, and type "make" to compile the executable.
Now do the same thing for vcf tools:
The vcftools binary will be in the cpp directory. Copy both the tabix and vcftools executables to wherever you want to run your analysis.
Let's say that you wanted to pull all the 1000 genomes data from the CETP gene on chromosome 16, compute allele frequencies, and drop a linkage format PED file so you can look at linkage disequilibrium using Haploview.
First, use tabix to hit the 1000 genomes FTP site, pulling data from the 20080804 release for the CETP region (chr16:56,995,835-57,017,756), and save that output to a file called genotypes.vcf. Because tabix doesn't download the entire 1000 Genomes data and pulls only the sections you need, this is extremely fast. This should take around a minute, depending on your web connection and CPU speeds.
Not too difficult, right? Now use vcftools (which works a lot like plink) to compute allele frequencies. This should take less than one second.
./vcftools --vcf genotypes.vcf --freq --out allelefrequencies
Finally, use vcftools to create a linkage format PED and MAP file that you can use in PLINK or Haploview. This took about 30 seconds for me.
That's it. It looks like you can also dig around in the supporting directory on the FTP site and pull out genotypes for specific ethnic groups as well (EUR, AFR, and ASN HapMap populations).
First, you'll need a Linux system, and two utilities: tabix and vcftools.
I'm virtualizing an Ubuntu Linux system in Virtualbox on my Windows 7 machine. I had a little trouble compiling vcftools on my Ubuntu system out of the box. Before trying to compile tabix and vcftools I'd recommend installing the GNU C++ compiler and another development version of a compression library, zlib1g-dev. This is easy in Ubuntu. Just enter these commands at the terminal:
sudo apt-get install g++
sudo apt-get install zlib1g-dev
First, download tabix. I'm giving you the direct link to the most current version as of this writing, but you might go to the respective sourceforge pages to get the most recent version yourself. Use tar to unpack the download, go into the unzipped directory, and type "make" to compile the executable.
wget http://sourceforge.net/projects/samtools/files/tabix/tabix-0.2.3.tar.bz2
tar -jxvf tabix-0.2.3.tar.bz2
cd tabix-0.2.3/
make
Now do the same thing for vcf tools:
wget http://sourceforge.net/projects/vcftools/files/vcftools_v0.1.4a.tar.gz
tar -zxvf vcftools_v0.1.4a.tar.gz
tar -zxvf vcftools_v0.1.4a.tar.gz
cd vcftools_0.1.4a/
make
make
The vcftools binary will be in the cpp directory. Copy both the tabix and vcftools executables to wherever you want to run your analysis.
Let's say that you wanted to pull all the 1000 genomes data from the CETP gene on chromosome 16, compute allele frequencies, and drop a linkage format PED file so you can look at linkage disequilibrium using Haploview.
First, use tabix to hit the 1000 genomes FTP site, pulling data from the 20080804 release for the CETP region (chr16:56,995,835-57,017,756), and save that output to a file called genotypes.vcf. Because tabix doesn't download the entire 1000 Genomes data and pulls only the sections you need, this is extremely fast. This should take around a minute, depending on your web connection and CPU speeds.
./tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 16:56995835-57017756 > genotypes.vcf
Not too difficult, right? Now use vcftools (which works a lot like plink) to compute allele frequencies. This should take less than one second.
./vcftools --vcf genotypes.vcf --freq --out allelefrequencies
Finally, use vcftools to create a linkage format PED and MAP file that you can use in PLINK or Haploview. This took about 30 seconds for me.
./vcftools --vcf genotypes.vcf --plink --out plinkformat
That's it. It looks like you can also dig around in the supporting directory on the FTP site and pull out genotypes for specific ethnic groups as well (EUR, AFR, and ASN HapMap populations).
Tags:
1000 genomes,
Linux,
PLINK,
Sequencing,
Software
Tuesday, November 23, 2010
Randomly Select Subsets of Individuals from a Binary Pedigree .fam File
I'm working on imputing GWAS data to the 1000 Genomes Project data using MaCH. For the model estimation phase you only need ~200 individuals. Here's a one-line unix command that will pull out 200 samples at random from a binary pedigree .fam file called myfamfile.fam:
Redirect this output to a file, and then run PLINK using the --keep option with this new file.
for i in `cut -d ' ' -f 1-2 myfamfile.fam | sed s/\ /,/g`; do echo "$RANDOM $i"; done | sort | cut -d' ' -f 2| sed s/,/\ /g | head -n 200
Redirect this output to a file, and then run PLINK using the --keep option with this new file.
Tags:
1000 genomes,
Imputation
Thursday, November 11, 2010
Split up a GWAS dataset (PED/BED) by Chromosome
As I mentioned in my recap of the ASHG 1000 genomes tutorial, I'm doing to be imputing some of my own data to 1000 genomes, and I'll try to post lessons learned along the way here under the 1000 genomes and imputation tags.
I'm starting from a binary pedigree format file (plink's bed/bim/fam format) and the first thing in the 1000 genomes imputation cookbook is to store your data in Merlin format, one per chromosome. Surprisingly there is no option in PLINK to split up a dataset into separate files by chromosome, so I wrote a Perl script to do it myself. The script takes two arguments: 1. the base filename of the binary pedfile (if your files are data.bed, data.bim, data.fam, the base filename will be "data" without the quotes); 2. a base filename for the output files to be split up by chromosome. You'll need PLINK installed for this to work, and I've only tested this on a Unix machine. You can copy the source code below:
I'm starting from a binary pedigree format file (plink's bed/bim/fam format) and the first thing in the 1000 genomes imputation cookbook is to store your data in Merlin format, one per chromosome. Surprisingly there is no option in PLINK to split up a dataset into separate files by chromosome, so I wrote a Perl script to do it myself. The script takes two arguments: 1. the base filename of the binary pedfile (if your files are data.bed, data.bim, data.fam, the base filename will be "data" without the quotes); 2. a base filename for the output files to be split up by chromosome. You'll need PLINK installed for this to work, and I've only tested this on a Unix machine. You can copy the source code below:
Tags:
1000 genomes,
Imputation
Tuesday, November 9, 2010
Video and slides from ASHG 1000 Genomes tutorials
If you missed the tutorial on the 1000 genomes project data last week at ASHG, you can now watch the tutorials on youtube and download the slides online at http://genome.gov/27542240. Here's a recap of the speakers and topics:
Visit http://genome.gov/27542240 for links to all the videos and slides. I found Jeff Barrett's overview of using the 1000 genomes data for imputation particularly helpful. Also, don't forget about Goncalo Abecasis's 1000 genomes imputation cookbook, which gives a little more detailed information about formatting, parallelizing code, etc. I'm going to be trying this myself soon, and I'll post tips along the way.
Introduction Gil McVean, Ph.D. | |
Description of the 1000 Genomes Data Gabor Marth, D.Sc. | |
How to Access the Data Steve Sherry, Ph.D. | |
How to Use the Browser Paul Flicek, Ph.D. | |
Stuctural Variants Jan Korbel, Ph.D. | |
How to Use the Data in Disease Studies Jeffrey Barrett, Ph.D. |
Visit http://genome.gov/27542240 for links to all the videos and slides. I found Jeff Barrett's overview of using the 1000 genomes data for imputation particularly helpful. Also, don't forget about Goncalo Abecasis's 1000 genomes imputation cookbook, which gives a little more detailed information about formatting, parallelizing code, etc. I'm going to be trying this myself soon, and I'll post tips along the way.
Tags:
1000 genomes,
Tutorials
Thursday, October 14, 2010
Tutorial on the 1000 Genomes Project Data
There will be a (free) tutorial on the 1000 genomes project at this year's ASHG meeting on Wednesday, November 3, 7:00 – 9:30pm. You can register online at the link below. The tutorial will describe the 1000 genomes data, how to access it, and what to do with it. Specifically, the speakers and topics covered are:
1. Introduction
2. Description of the 1000 Genomes data -- Gabor Marth
3. How to access the data -- Steve Sherry
4. How to use the browser -- Paul Flicek
5. Structural variants -- Jan Korbel
6. How to use the data in disease studies -- Jeff Barrett
7. Q&A
Online registration for 1000 genomes tutorial
Hopefully I'll see some of you there. I'm not sure if imputation is covered in this tutorial. If not, I will cover it here in a future post. I'll soon be using Goncalo Abecasis's 1000 Genomes Imputation Cookbook to impute my own data to the 1kG SNPs, and I'll share any tips I discover along the way.
1. Introduction
2. Description of the 1000 Genomes data -- Gabor Marth
3. How to access the data -- Steve Sherry
4. How to use the browser -- Paul Flicek
5. Structural variants -- Jan Korbel
6. How to use the data in disease studies -- Jeff Barrett
7. Q&A
Online registration for 1000 genomes tutorial
Hopefully I'll see some of you there. I'm not sure if imputation is covered in this tutorial. If not, I will cover it here in a future post. I'll soon be using Goncalo Abecasis's 1000 Genomes Imputation Cookbook to impute my own data to the 1kG SNPs, and I'll share any tips I discover along the way.
Tags:
1000 genomes,
Announcements,
Tutorials
Subscribe to:
Posts (Atom)

