Getting Genetics Done: PLINK

Showing posts with label PLINK. Show all posts

Wednesday, November 20, 2013

Using Database Joins to Compare Results Sets

One of the most powerful tools you can learn to use in genomics research is a relational database system, such as MySQL. These systems are fairly easy to setup and use, and provide users the ability to organize and manipulate data and statistical results with simple commands. As a graduate student (during the height of GWAS), this single skill quickly turned me into an “expert”. By storing the SNP lists for common GWAS platforms and some simple annotations from the UCSC and ENSEMBL databases, I could quickly provide lists of SNPs within a gene or collection of genes, or pull a list of SNPs that overlap two genotyping platforms. We even developed database modules that allowed us to easily define LD blocks within a database query (called LD-Spline).

Once you learn the basics of defining tables and loading data, you can start to join tables together, matching them on a common field. This is where the true power of a database system lies. Suppose you have two sets of results from a PLINK analysis, one from a discovery dataset and another from a replication. Rather than clumsily matching two sets of results within a spreadsheet application, a few simple queries within MySQL will tell you which SNPs are in common between the two sets, which were not found in the replication set, which SNPs were significant in the first set but not the second, etc.

The concept that makes these operations work is the idea of a primary key. A primary key is some field of a dataset that uniquely identifies each row of the table/dataset. In the above example of PLINK results, a good primary key might be the RS number of the SNP. You can also uniquely identify rows based on two columns, a concept known as a composite key – for example, the chromosome AND position of a SNP. Establishing a primary key allows MySQL to keep data stored in a sorted order and allows the matching operations for table joins to be performed much faster.

Having this sorted order from a primary key prevents MySQL from having to scan an entire table to find a specific value. Much like the index of a book, a primary key lets MySQL find a value within a table very quickly. If a table is small, having a primary key is not as critical; the computer can quickly scan the entire contents of the table for any query. If the table is large, however, a full scan of the entire table could be a costly operation, and the number of table scans required increases when doing a join. For example, if we join tables for our discovery and replication results sets, the database system will take the RS number for each entry from the discovery table and attempt to find a matching RS number in the replication table. If the replication table has the RS number as a primary key, the database system can very quickly find this entry. There is a fantastic post on the various types of database joins here.

Let's start by creating our database tables. A typical PLINK association output contains 12 columns (CHR, SNP, BP, A1, TEST, NMISS, OR, SE, L95, U95, STAT, P). In these tables, we've established the SNP column as the primary key. Recall that the primary key must uniquely identify each row of the table, so if there are multiple rows per SNP -- sometimes PLINK will report multiple TEST rows per SNP. If this is the case, we may need to either establish a composite key using PRIMARY KEY (`snp`,`test`), or simply eliminate these rows from the data file using an AWK command.

CREATE TABLE `discovery` (
 `chr` varchar(1),
        `snp` varchar(32),
        `bp` int, 
        `a1` varchar(1),
        `test` varchar(3),
        `nmiss` int,
        `or` float,
        `se` float,
        `l95` float,
        `u95` float,
        `stat` float,
 `p` float,
 PRIMARY KEY (`snp`)
);

CREATE TABLE `replication` (
       `chr` varchar(1),
       `snp` varchar(32),
       `bp` int,
       `a1` varchar(1),
       `test` varchar(3),
       `nmiss` int,
       `or` float,
       `se` float,
       `l95` float,
       `u95` float,
       `stat` float,
       `p` float,
       PRIMARY KEY (`snp`)
);

Before loading our data into these tables, a little pre-processing is helpful. To ensure that results are easy to read on the screen, PLINK developers used leading spaces in the column format for many PLINK outputs. These make loading the results into a database difficult. We can resolve this by running a simple SED command:

sed -r -e 's/\s+/\t/' -e 's/^\t//g' input-file.assoc.logistic > discovery.load

This will convert all spaces to tabs and will eliminate the leading spaces and write the results to a new file, discovery.load. Now lets load this file into our table, and repeat the procedure for our replication file.

LOAD DATA LOCAL INFILE '{PathToFile}/discovery.load' INTO TABLE 
discovery FIELDS TERMINATED BY '\t' IGNORE 1 LINES;

Now we should have two MySQL database tables with the discovery and results sets loaded into them. We can view their contents with a simple select statement. Then, finally, we can join these two tables to easily compare the results from the discovery and replication analyses.

SELECT * FROM discovery INNER JOIN replication ON 
discovery.snp = replication.snp;

The syntax is simple: select a set of fields -- in this case all of them (represented by the *) -- from the first table (discovery), matching each row from this table to a row in the second table (replication) where the discovery SNP equals the replication SNP. MySQL also supports a table alias which can make these queries a bit easier to write. An alias is simply a label specified after a table name which can be used in the rest of the query in place of the full table name. For example, in the query below, we use a for the discovery table and b for the replication table.

SELECT * FROM discovery a INNER JOIN replication b ON 
a.snp = b.snp;

With practice and additional data, join operations can be used to annotate results by gene or region, and to match these to results from other studies, such as the NHGRI GWAS catalog.

Wednesday, May 16, 2012

Stepping Outside My Open-Source Comfort Zone: A First Look at Golden Helix SVS

I'm a huge supporter of the Free and Open Source Software movement. I've written more about R than anything else on this blog, all the code I post here is free and open-source, and a while back I invited you to steal this blog under a cc-by-sa license.

Every now and then, however, something comes along that just might be worth paying for. As a director of a bioinformatics core with a very small staff, I spend a lot of time balancing costs like software licensing versus personnel/development time, so that I can continue to provide a fiscally sustainable high-quality service.

As you've likely noticed from my more recent blog/twitter posts, the core has been doing a lot of gene expression and RNA-seq work. But recently had a client who wanted to run a fairly standard case-control GWAS analysis on a dataset from dbGaP. Since this isn't the focus of my core's service, I didn't want to invest the personnel time in deploying a GWAS analysis pipeline, downloading and compiling all the tools I would normally use if I were doing this routinely, and spending hours on forums trying to remember what to do with procedural issues such as which options to specify when running EIGENSTRAT or KING, or trying to remember how to subset and LD-prune a binary PED file, or scientific issues, such as whether GWAS data should be LD-pruned at all before doing PCA.

Golden Helix

A year ago I wrote a post about the "Hitchhiker's Guide to Next-Gen Sequencing" by Gabe Rudy, a scientist at Golden Helix. After reading this and looking through other posts on their blog, I'm confident that these guys know what they're doing and it would be worth giving their product a try. Luckily, I had the opportunity to try out their SNP & Variation Suite (SVS) software (I believe you can also get a free trial on their website).

I'm not going to talk about the software - that's for a future post if the core continues to get any more GWAS analysis projects. In summary - it was fairly painless to learn a new interface, import the data, do some basic QA/QC, run a PCA-adjusted logistic regression, and produce some nice visualization. What I want to highlight here is the level of support and documentation you get with SVS.

Documentation

First, the documentation. At each step from data import through analysis and visualization there's a help button that opens up the manual at the page you need. This contextual manual not only gives operational details about where you click or which options to check, but also gives scientific explanations of why you might use certain procedures in certain scenarios. Here's a small excerpt of the context-specific help menu that appeared when I asked for help doing PCA.

What I really want to draw your attention to here is that even if you don't use SVS you can still view their manual online without registering, giving them your e-mail, or downloading any trialware. Think of this manual as an always up-to-date mega-review of GWAS - with it you can learn quite a bit about GWAS analysis, quality control, and statistics. For example, see this section on haplotype frequency estimation and the EM algorithm. The section on the mathematical motivation and implementation of the Eigenstrat PCA method explains the method perhaps better than the Eigenstrat paper and documentation itself. There are also lots of video tutorials that are helpful, even if you're not using SVS. This is a great resource, whether you're just trying to get a better grip on what PLINK is doing, or perhaps implementing some of these methods in your own software.

Support

Next, the support. After installing SVS on both my Mac laptop and the Linux box where I do my heavy lifting, one of the product specialists at Golden Helix called me and walked me through every step of a GWAS analysis, from QC to analysis to visualization. While analyzing the dbGaP data for my client I ran into both software-specific procedural issues as well as general scientific questions. If you've ever asked a basic question on the R-help mailing list, you know need some patience and a thick skin for all the RTFM responses you'll get. I was able to call the fine folks at Golden Helix and get both my technical and scientific questions answered in the same day. There are lots of resources for getting your questions answered, such as SEQanswers, Biostar, Cross Validated, and StackOverflow to name a few, but getting a forum response two days later from "SeqGeek96815" doesn't compare to having a team of scientists, statisticians, programmers, and product specialists on the other end of a telephone whose job it is to answer your questions.

Final Thoughts

This isn't meant to be a wholesale endorsement of Golden Helix or any other particular software company - I only wanted to share my experience stepping outside my comfortable open-source world into the walled garden of a commercially-licensed software from a for-profit company (the walls on the SVS garden aren't that high in reality - you can import and export data in any format imaginable). One of the nice things about command-line based tools is that it's relatively easy to automate a simple (or at least well-documented) process with tools like Galaxy, Taverna, or even by wrapping them with perl or bash scripts. However, the types of data my clients are collecting and the kinds of questions they're asking are always a little new and different, which means I'm rarely doing the same exact analysis twice. Because of the level of documentation and support provided to me, I was able to learn a new interface to a set of familiar procedures and run an analysis very quickly and without spending hours on forums figuring out why a particular program is seg-faulting. Will I abandon open-source tools like PLINK for SVS, Tophat-Cufflinks for CLC Workbench, BWA for NovoAlign, or R for Stata? Not in a million years. I haven't talked to Golden Helix or some of the above-mentioned companies about pricing for their products, but if I can spend a few bucks and save the time it would taken a full time technician at $50k+/year to write a new short read aligner or build a new SNP annotation database server, then I'll be able to provide a faster, high-quality, fiscally sustainable service at a much lower price for the core's clients, which is all-important in a time when federal funding is increasingly harder to come by.

Thursday, April 21, 2011

How To Get Allele Frequencies and Create a PED file from 1000 Genomes Data

I recently analyzed some next-generation sequencing data, and I first wanted to compare the frequencies in my samples to those in the 1000 Genomes Project. It turns out this is much easier that I thought, as long as you're a little comfortable with the Linux command line.

First, you'll need a Linux system, and two utilities: tabix and vcftools.

I'm virtualizing an Ubuntu Linux system in Virtualbox on my Windows 7 machine. I had a little trouble compiling vcftools on my Ubuntu system out of the box. Before trying to compile tabix and vcftools I'd recommend installing the GNU C++ compiler and another development version of a compression library, zlib1g-dev. This is easy in Ubuntu. Just enter these commands at the terminal:

sudo apt-get install g++

sudo apt-get install zlib1g-dev

First, download tabix. I'm giving you the direct link to the most current version as of this writing, but you might go to the respective sourceforge pages to get the most recent version yourself. Use tar to unpack the download, go into the unzipped directory, and type "make" to compile the executable.

wget http://sourceforge.net/projects/samtools/files/tabix/tabix-0.2.3.tar.bz2

tar -jxvf tabix-0.2.3.tar.bz2

cd tabix-0.2.3/ 

make

Now do the same thing for vcf tools:

wget http://sourceforge.net/projects/vcftools/files/vcftools_v0.1.4a.tar.gz

tar -zxvf vcftools_v0.1.4a.tar.gz 

cd vcftools_0.1.4a/

make

The vcftools binary will be in the cpp directory. Copy both the tabix and vcftools executables to wherever you want to run your analysis.

Let's say that you wanted to pull all the 1000 genomes data from the CETP gene on chromosome 16, compute allele frequencies, and drop a linkage format PED file so you can look at linkage disequilibrium using Haploview.

First, use tabix to hit the 1000 genomes FTP site, pulling data from the 20080804 release for the CETP region (chr16:56,995,835-57,017,756), and save that output to a file called genotypes.vcf. Because tabix doesn't download the entire 1000 Genomes data and pulls only the sections you need, this is extremely fast. This should take around a minute, depending on your web connection and CPU speeds.

./tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 16:56995835-57017756 > genotypes.vcf

Not too difficult, right? Now use vcftools (which works a lot like plink) to compute allele frequencies. This should take less than one second.

./vcftools --vcf genotypes.vcf --freq --out allelefrequencies

Finally, use vcftools to create a linkage format PED and MAP file that you can use in PLINK or Haploview. This took about 30 seconds for me.

./vcftools --vcf genotypes.vcf --plink --out plinkformat

That's it. It looks like you can also dig around in the supporting directory on the FTP site and pull out genotypes for specific ethnic groups as well (EUR, AFR, and ASN HapMap populations).

Tuesday, March 29, 2011

Prune GWAS data in R

Hansong Wang, our biostats professor here at the Hawaii Cancer Center, generously gave me some R code that goes through a SNP annotation file (i.e. a mapfile) and selects SNPs that are at least a certain specified distance apart. You might want to do this if you're picking a subset of SNPs for PCA, for instance. Plink has an LD pruning feature, but if you can't load your data into PLINK, this poor-man's-pruning based on physical distance (not LD) is a quick solution.

Provide the function with a data frame containing containing column names "chrom" and "position," where the SNPs are ordered by chromosome and position. By default the function selects SNPs that are at least 100kb apart, but you can change this with the optional second argument. The function returns the row indices corresponding to the SNPs you want to keep. Then simply subset your dataset selecting only those row indices and all columns.

Tuesday, November 16, 2010

Parallelize IBD estimation with PLINK

Obtaining the probability that zero, one, or two alleles are shared identical by descent (IBD) is useful for many reasons in a GWAS analysis. A while back I showed you how to visualize sample relatedness using R and ggplot2, which requires IBD estimates. Using plink --genome uses IBS and allele frequencies to infer IBD. While a recent article in Nature Reviews Genetics on IBD and IBS analysis demonstrates potentially superior approaches, PLINK's approach is definitely the easiest because of PLINK's superior data management capabilities. The problem with IBD inference is that while computation time is linear with respect to the number of SNPs, it's geometric (read: SLOW) with respect to the number of samples. With GWAS data on 10,000 samples, (10,000 choose 2) = 49,995,000 pairwise IBD estimates. This can take quite some time to calculate on a single processor.

A developer in Will's lab, Justin Giles, wrote a Perl script which utilizes one of PLINK's advanced features, --genome-lists, which takes two files as arguments. You can read about this feature under the advanced hint section of the PLINK documentation of IBS clustering. Each of these files contain lists of family IDs and individual IDs of samples for whom you'd like to calculate IBD. In other words, you can break up the IBD calculations by groups of samples, instead of requiring a single process to do it all. The Perl script also takes the base filename of your binary pedfile and parses the .fam file to split up the list of individuals into small chunks. The size of these chunks are specified by the user. Assuming you have access to a high-performance computing cluster using Torque/PBS for scheduling and job submission, the script also writes out PBS files that can be used to submit each of the segments to a node on a supercomputing cluster (although this can easily be modified to fit other parallelization frameworks, so modify the script as necessary). The script also needs all the minor allele frequencies (which can easily be attained with the --freq option in PLINK).

One of the first things the script does is parses and splits up your fam file into chunks of N individuals (where N is set by the user - I used 100 and estimation only took ~15 minutes). This can be accomplished by a simple gawk command:

gawk '{print $1,$2}' data.fam | split -d -a 3 -l 100 - tmp.list

Then the script sets up some PBS scripts (like shell scripts) to run PLINK commands:

At which point you'll have the output files...

data.sub.1.1.genome
data.sub.1.2.genome
data.sub.1.3.genome
data.sub.1.4.genome
data.sub.2.2.genome
data.sub.2.3.genome
data.sub.2.4.genome
data.sub.3.3.genome
data.sub.3.4.genome
data.sub.4.4.genome

...that you can easily concatenate.

Here's the perl script below. To run it, give it the full path to your binary pedfile, the number of individuals in each "chunk" to infer IBD between, and the fully qualified path to your .frq file that you get from running plink --freq. If you're not using PBS to submit jobs, you'll have to modify the code a little bit in the main print statement in the middle. If you're not running this in your /scratch/username/ibd directory, you'll want to change that on line 57. You'll also want to change your email address on line 38 if you want to receive emails from your scheduler if you use PBS.

After you submit all these jobs, you can very easily run these commands to concatenate the results and clean up the temporary files:

cat data.sub.*genome > results.genome
rm tmp.list*
rm data.sub.*

Tuesday, August 3, 2010

Convert PLINK output to tab or comma delimited CSV using Perl

Last week Will showed you a bash script version of a sed command covered here a while back that would convert PLINK output from the default variable space-delimited format to a more database-loading-friendly tab or comma delimited file. A commenter asked how to do this on windows, so I'll share the way I do this using a perl script which you can use on windows after installing ActivePerl. First copy the code below and save the file as cleanplink.pl somewhere in your path.

#!/usr/bin/perl

# cleanplink.pl
# (c) Stephen D. Turner 2010 http://www.stephenturner.us/
# This is free open-source software.
# See http://gettinggeneticsdone.blogspot.com/p/copyright.html

my $help = "\nUsage: $0 <input whitespace file> <tab or comma>\n\n";
die $help if @ARGV<2;

$delimiter=pop(@ARGV);
die $help unless ($delimiter=~/tab/i|$delimiter=~/comma/i);
@inputfiles=@ARGV;

if ($delimiter =~ /comma/i) {
    foreach (@inputfiles) {

        open (IN,"<$_");
        open (OUT,">$_.csv");
        while (<IN>) {
            chomp;
            $_ =~ s/^\s+//;  #Trim whitespace at beginning
            $_ =~ s/\s+$//;  #Trim whitespace at end
            $_ =~ s/\s+/,/g; #Remaining whitespace into commas
            #$_ =~ s/NA/-9/g;#If you want to recode NA as -9
            print OUT "$_\n";
        }
    }
} elsif ($delimiter =~ /tab/i) {
    foreach (@inputfiles) {
        open (IN,"<$_");
        open (OUT,">$_.tab");
        while (<IN>) {
            chomp;
            $_ =~ s/^\s+//;  #Trim whitespace at beginning
            $_ =~ s/\s+$//;  #Trim whitespace at end
            $_ =~ s/\s+/\t/g;#Remaining whitespace into commas
            #$_ =~ s/NA/-9/g;#If you want to recode NA as -9
            print OUT "$_\n";
        }
    }
} else {
    die $help;
}

Run the program with the first argument(s) as the PLINK output file(s) you want to convert, and the last argument being either "comma" or "tab" without the quotes. It'll create another file in the current directory ending with either .csv or .tab. Look below to see cleanplink.pl in action.

turnersd@provolone:~/tmp$ ls
plink.qassoc
turnersd@provolone:~/tmp$ cat plink.qassoc
 CHR         SNP         BP    NMISS       BETA         SE         R2        T            P
   1   rs3094315     742429     3643    -0.2461     0.2703  0.0002275  -0.9102       0.3628
   1  rs12562034     758311     3644    -0.1806     0.3315  8.149e-05  -0.5448       0.5859
   1   rs3934834     995669     3641    0.04591     0.2822  7.271e-06   0.1627       0.8708
   1   rs9442372    1008567     3645     0.1032     0.2063  6.868e-05   0.5002       0.6169
   1   rs3737728    1011278     3644     0.1496     0.2268  0.0001195   0.6598       0.5094
   1   rs6687776    1020428     3645    -0.5378     0.2818   0.000999   -1.909      0.05639
   1   rs9651273    1021403     3643     0.2002     0.2264  0.0002149   0.8847       0.3764
   1   rs4970405    1038818     3645    -0.4994     0.3404  0.0005903   -1.467       0.1425
   1  rs12726255    1039813     3645    -0.4515     0.2956  0.0006398   -1.527       0.1268
turnersd@provolone:~/tmp$ cleanplink.pl plink.qassoc comma
turnersd@provolone:~/tmp$ ls
plink.qassoc  plink.qassoc.csv
turnersd@provolone:~/tmp$ cat plink.qassoc.csv
CHR,SNP,BP,NMISS,BETA,SE,R2,T,P
1,rs3094315,742429,3643,-0.2461,0.2703,0.0002275,-0.9102,0.3628
1,rs12562034,758311,3644,-0.1806,0.3315,8.149e-05,-0.5448,0.5859
1,rs3934834,995669,3641,0.04591,0.2822,7.271e-06,0.1627,0.8708
1,rs9442372,1008567,3645,0.1032,0.2063,6.868e-05,0.5002,0.6169
1,rs3737728,1011278,3644,0.1496,0.2268,0.0001195,0.6598,0.5094
1,rs6687776,1020428,3645,-0.5378,0.2818,0.000999,-1.909,0.05639
1,rs9651273,1021403,3643,0.2002,0.2264,0.0002149,0.8847,0.3764
1,rs4970405,1038818,3645,-0.4994,0.3404,0.0005903,-1.467,0.1425
1,rs12726255,1039813,3645,-0.4515,0.2956,0.0006398,-1.527,0.1268
turnersd@provolone:~/tmp$ cleanplink.pl plink.qassoc tab
turnersd@provolone:~/tmp$ ls
plink.qassoc  plink.qassoc.csv  plink.qassoc.tab
turnersd@provolone:~/tmp$ cat plink.qassoc.tab
CHR     SNP     BP      NMISS   BETA    SE      R2      T       P
1       rs3094315       742429  3643    -0.2461 0.2703  0.0002275       -0.9102 0.3628
1       rs12562034      758311  3644    -0.1806 0.3315  8.149e-05       -0.5448 0.5859
1       rs3934834       995669  3641    0.04591 0.2822  7.271e-06       0.1627  0.8708
1       rs9442372       1008567 3645    0.1032  0.2063  6.868e-05       0.5002  0.6169
1       rs3737728       1011278 3644    0.1496  0.2268  0.0001195       0.6598  0.5094
1       rs6687776       1020428 3645    -0.5378 0.2818  0.000999        -1.909  0.05639
1       rs9651273       1021403 3643    0.2002  0.2264  0.0002149       0.8847  0.3764
1       rs4970405       1038818 3645    -0.4994 0.3404  0.0005903       -1.467  0.1425
1       rs12726255      1039813 3645    -0.4515 0.2956  0.0006398       -1.527  0.1268

Tuesday, July 6, 2010

Convert PLINK output to CSV Revisited

A while back, Stephen wrote a very nice post about converting PLINK output to a CSV file. If you are like me, you have used this a thousand times -- enough to get tired of typing lots of SED commands.

I just crafted a little BASH script that accomplishes the same effect with a single easy to type command. Insert the following text into your .bashrc file. This file is generally hidden in your UNIX home directory (you can see it if you type 'ls -al').

This version converts the infile to a tab-delimited output.

function cleanplink 

{

sed -r 's/\s+/\t/g' $1 | sed -r 's/^\t//g' | sed -r 's/NA/\\N/g' > $1.txt

}

And this version converts to a CSV file.

function cleanplink 

{

sed -r 's/\s+/,/g' $1 | sed -r 's/^,//g' | sed -r 's/NA/\\N/g' > $1.csv

}

I also converted the "NA" to a Null value for easy loading into MySQL, however you can remove that bit if you'd like:

function cleanplink 

{

sed -r 's/\s+/,/g' $1 | sed -r 's/^,//g' > $1.csv

}

You use this function as follows:

bush@queso:~$ cleanplink plinkresults.assoc

and it produces a file with the same name, but with a ".csv" or a ".txt" on the end.

Wednesday, June 9, 2010

Efficient Mixed-Model Association eXpedited (EMMAX) to Simutaneously Account for Relatedness and Stratification in Genome-Wide Association Studies

A few months ago I covered an algorithm called EMMA (Efficient Mixed-Model Association) implemented in R for simultaneously correct for both population stratification and relatedness in an association study. This method/software is very useful because most methods that account for relatedness in an association study assume a genetically (ethnically) homogeneous population, while methods that detect and correct for population stratification typically assume individuals are unrelated. The EMMA algorithm simultaneously accounts for both types of population structure by using a linear mixed model with an empirically estimated relatedness matrix to model the correlation between phenotypes of sample subjects.

The original EMMA algorithm, however, is computationally infeasible for datasets with thousands of individuals because the variance components parameters are estimated for each marker, which can take about 10 minutes per marker on the authors' large GWAS dataset, which would take over 6 years to complete on a single processor. A new implementation of the algorithm called EMMAX (Efficient Mixed-Model Association eXpedited) makes the simplifying assumption that because the effect of any given SNP on the trait is typically small, then the variance parameters only need to be estimated once for the entire dataset, rather than once for each marker.

In the paper the authors take the Northern Finland Birth Cohort and estimate genomic control inflation factors (gamma) for uncorrected test statistics, test statistics adjusted for the top 100 principle components using Eigenstrat, and corrected for structure using the EMMAX algorithm and found that the inflation factors were closest to 1 for the EMMAX-corrected tests. Further, whereas genomic control simply adjusts all test statistics downward without changing the rank of the test statistics, the EMMAX method does result in changes of the ranks of test statistics for each SNP.

A beta version of EMMAX is available online, with a complete version to be released soon. Conveniently, the software is able to take a PLINK transposed ped file and covariate files as input (tped and tfam documentation here).

Nature Genetics Technical Report - Variance component model to account for sample structure in genome-wide association studies

Tuesday, February 23, 2010

Happy Birthday GGD! The 10 Most Popular Posts Since GGD's Launch

The first post on Getting Genetics Done was one year ago today. To celebrate, here are the top 10 most viewed posts since GGD launched last year. Incidentally, nine of the ten are tutorials on how to do something in R. Thanks to all the readers and all the commenters for sharing your thoughts!

ggplot2 Tutorial: Scatterplots in a Series of Small Multiples

GWAS Manhattan plots and QQ plots using ggplot2 in R

Hierarchical Clustering in R (by Will Bush)

Comparison of plots using Stata, R base, R lattice, and R ggplot2: Histograms

Merge data from different files using R

QQ plots of p-values in R using ggplot2

PDF tutorial from R course (Introduction to R)

Visualizing sample relatedness in a GWAS using PLINK and R

Split, apply, and combine in R using PLYR

Linux Command Line Cheat Sheet

Wednesday, February 10, 2010

LocusZoom: Plot regional association results from GWAS

Update Friday, May 14, 2010: See this newer post on LocusZoom.

If you caught Cristen Willer's seminar here a few weeks ago you saw several beautiful figures in the style of a manhattan plot, but zoomed in around a region of interest, with several other useful information overlays.

Click to take a look at this plot below, showing the APOE region for an Alzheimer's Disease GWAS:

It's a simple plot of the -log10(p-values) for SNPs in a given region, but it also shows:

1. LD information (based on HapMap) shown by color-coded points (not much LD here).
2. Recombination rates (the blue line running through the plot). Peaks are hotspots.
3. Spatial orientation of the SNPs you plotted (running across the top)
3. Genes! The overlay along the bottom shows UCSC genes in the region.

You can very easily take a PLINK output file (or any other format) and make an image like this for your data for any SNP, gene, or region of interest using a tool Cristen and others at Michigan developed called LocusZoom. LocusZoom is written in R with a Python wrapper that works from an easy to use web interface.

All the program needs is a list of SNP names and their associated P-values. If you're using PLINK, your *.assoc or *.qassoc files have this information, but first you'll have to run a quick command to format them. Run this command I discussed in a previous post to convert your PLINK output into a comma delimited CSV file (PLINK's default is irregular whitespace delimited):

cat plink.assoc | sed -r 's/^\s+//g' | sed -r 's/\s+/,/g' > plink.assoc.csv

Next, you'll want to compress this file so that it doesn't take forever to upload.

gzip plink.assoc.csv

Now, upload your new file (plink.assoc.csv.gz) on the LocusZoom website. Tell it that your p-value column is named "P" and your marker column is named "SNP" (or whatever they're called if you're not using PLINK). Change the delimiter type to "comma", then put in a region of interest. I chose APOE, but you could also use a SNP name (include the "rs" before the number). Now hit "Plot your Data," and it should take about a minute.

There are some other options below, but I've had bad luck using any of them. For instance, I can never get it to output a PNG properly - only PDF works the last time I tried it. I also could not successfully make a plot if I turn off the recombination rate overlay. I know this is a very early version, but hopefully they'll clean up some of the code and document some of its features very soon. I could see this being a very useful tool, especially once it's available for download for local use. (Update: some of these bugs have been fixed. See this newer post on LocusZoom).

LocusZoom: Plot regional association results from GWAS

Wednesday, January 20, 2010

GWAS Manhattan plots and QQ plots using ggplot2 in R

**** UPDATE, May 15 2014 *****
The functions described here have now been wrapped into an R package. View the updated blog post or see the online package vignette for how to install and use. If you'd still like to use the old code described here, you can access this at version 0.0.0 on GitHub. The code below likely won't work.
*****************************

Will posted earlier this week about how to produce manhattan plots of GWAS results using Stata, so I thought I'd share how I do this in R using ggplot2.

First, if you've never used ggplot2, you'll need to add it to your R installation by typing:

install.packages("ggplot2")

Once you've done that, copy and paste this command to download the functions I wrote necessary to produce these plots. If you'd like to see the source code yourself, copy the URL into your web browser.

source("http://dl.dropbox.com/u/66281/0_Permanent/qqman.r")

Next, read in the PLINK results file to a data frame. Substitute plink.qassoc with your own results filename.

mydata=read.table("plink.qassoc", header=TRUE)

Finally, run this function which will produce and save in the current directory both a QQ-plot and a manhattan plot of your results:

qqman(mydata)

QQ-Plot (these are simulated GWAS results):

Manhattan plot (click to enlarge):

A few notes: First, if you're doing this on a linux machine from your Windows computer, you'll need to be running the previously mentioned XMing server on your Windows computer for the plot to save correctly. Second, the qqman() function calls the manhattan() function, which is extremely slow and memory-intensive. It may take about 3 minutes to run for each dataset. The memory issue isn't a problem on 64-bit machines, but you may run out of memory on 32-bit machines if you're doing this with GWAS data. I'm going to try to improve this in the future. Finally, using that source command you also downloaded a function I wrote called qqmanall(), which does just what it sounds like - if you run it on a linux machine with no arguments it reads in ALL of the plink GWAS results stored in the current directory, and creates QQ and manhattan plots for all of them with a common upper limit for the y-axis corresponding to the most significant result. Enjoy.

...

Update Thursday, January 21, 2010: I neglected to mention yesterday the format of the plink.assoc or plink.qassoc files, in case you want to produce the same plots using results from another software other than plink. When you load your .assoc files in a data frame, the relevant columns are named "CHR", "BP", and "P". You can use this plotting function as long as you have these three columns in your data frame, regardless of whether you use PLINK or not.

The latest version of the code is reproduced below:

**** UPDATE, May 15 2014 *****
The functions described here have now been wrapped into an R package. View the updated blog post or see the online package vignette for how to install and use. If you'd still like to use the old code described here, you can access this at version 0.0.0 on GitHub. The code below likely won't work.
*****************************

...

**** UPDATE, May 15 2014 *****
The functions described here have now been wrapped into an R package. View the updated blog post or see the online package vignette for how to install and use. If you'd still like to use the old code described here, you can access this at version 0.0.0 on GitHub. The code below likely won't work.
*****************************

Friday, October 9, 2009

Visualizing sample relatedness in a GWAS using PLINK and R

Strict quality control procedures are extremely important for any genome-wide association study. One of the first steps you should take when running QC on your GWAS is to look for related samples in your dataset. This does two things for you. First, you can get an idea of how many related samples you have in your dataset, and second, if you have access to self-report relationship information, you can identify potential sample mix-ups based on discrepancies between genetic information and self-report.

PLINK allows you to estimate genomewide IBD-sharing coefficients between seemingly unrelated individuals from whole-genome data. You can read lots more about the particular method they use to infer IBD given IBS and allele frequencies in the PLINK paper. It's relatively straightforward to compute these estimates. If you already have your data in pedfile format, use the following command (assuming mydata.ped and mydata.map are in the current directory where you run PLINK).

plink --file mydata --genome --min 0.05

This will usually take a few days if you use genome-wide data. You could speed this up by first thinning your dataset to around 100,000 markers using the --thin option to create a smaller dataset.

After you've run this you'll have a file in that directory called plink.genome. You can read about the output here, but we're really interested in Z1 and Z0, the proportion of markers identical by descent 1 and 0 respectively, for every pair of individuals in the dataset. You can plot Z1 by Z0 and color code them by the relationship type as coded in the pedfile using this R code.

Here's what the columns of interest will look like:

  RT     Z1     Z0
1 UN 0.1697 0.8294
2 OT 0.5034 0.4879
3 OT 0.4403 0.5439
4 OT 0.5200 0.4674
5 UN 0.1646 0.8311
6 OT 0.5519 0.4481

First, fire up R, go to the directory where your plink results are located and load in the data:

d = read.table("plink.genome", header=T)

Next, set up the plot area. The par option makes the points solid instead of an open circle, and the plot command sets up your plot and axes without adding any points.

par(pch=16)

with(d,plot(Z0,Z1, xlim=c(0,1), ylim=c(0,1), type="n"))

Your plot should be empty, like this:

Now use this command to plot points where the relationship type is FS, or full sibling. Make the points green (col=3).

with(subset(d,RT=="FS") , points(Z0,Z1,col=3))

Do the same thing with half sibs, "other related" (have the same family ID), parent offspring pairs, and unrelated, making each of them a different color.

with(subset(d,RT=="HS") , points(Z0,Z1,col="darkorange"))

with(subset(d,RT=="OT") , points(Z0,Z1,col=4))

with(subset(d,RT=="PO") , points(Z0,Z1,col=2))

with(subset(d,RT=="UN") , points(Z0,Z1,col=1))

You might want to rerun the commands for "other related" and especially half sibs, because they were covered up when you plotted unrelateds.

with(subset(d,RT=="OT") , points(Z0,Z1,col=4))

with(subset(d,RT=="HS") , points(Z0,Z1,col="darkorange"))

Finally, add a legend

legend(1,1, xjust=1, yjust=1, legend=levels(d$RT), pch=16, col=c(3,"darkorange",4,2,1))

For a recap, your code should look like this:

d = read.table("plink.genome", header=T)

par(pch=16)
with(d,plot(Z0,Z1, xlim=c(0,1), ylim=c(0,1), type="n"))

with(subset(d,RT=="FS") , points(Z0,Z1,col=3))

with(subset(d,RT=="HS") , points(Z0,Z1,col="darkorange"))

with(subset(d,RT=="OT") , points(Z0,Z1,col=4))

with(subset(d,RT=="PO") , points(Z0,Z1,col=2))

with(subset(d,RT=="UN") , points(Z0,Z1,col=1))

with(subset(d,RT=="OT") , points(Z0,Z1,col=4))

with(subset(d,RT=="HS") , points(Z0,Z1,col="darkorange"))

legend(1,1, xjust=1, yjust=1, legend=levels(d$RT), pch=16, col=c(3,"darkorange",4,2,1))

What you can see from the above plot is that self-reported parent-offspring pairs share 100% of their alleles IBD=1. There is a single sib pair at the bottom left where Pr(IBD=1)=0 and Pr(IBD=0)=0. This means that this sib pair shares 2 alleles identical by descent at every locus across the genome. This is either a duplicated sample or a set of identical twins. The full sibs cluster right around the middle of the plot, and you have lots of unrelated or "other related" pairs stretching from completely unrelated (bottom right quadrant) to relatively closely related (2nd or 3rd degree relatives, right around the middle of the plot). If you see self-reported sibs showing up where the unrelated individuals cluster on this plot, or vice versa, this should clue you in on potential sample identity problems.

...

Just for fun, let's make the same plot using ggplot2. Make sure you have the data loaded as above, then install ggplot2 if you haven't already

install.packages("ggplot2")

You only have to install the package once. Next load the package:

library(ggplot2)

Use this incredibly simple command to make the plot:

qplot(Z0,Z1,data=d,colour=RT)

I'm not a huge fan of these colors, so lets make them the same as above:

qplot(Z0,Z1,data=d,colour=RT) + scale_colour_manual(value=c(3,"darkorange",4,2,1))

That's a little better. Obviously the code to do this is MUCH simpler in ggplot2. The only problem I have with this is that it automatically plots the points in the order of the levels of RT, ie. FS, then HS, then OT, PO, then UN, so you end up covering up all your half-sibs with "other related" and unrelated pairs. Post a comment if you know how to go back and replot the HS points over the UN points. For more on ggplot2, see my previous post comparing histograms made using Stata, R base, R lattice and R ggplot2.

Friday, August 28, 2009

Convert PLINK output to CSV

I tip my hat to Will for showing me this little command line trick. PLINK's output looks nice when you print it to the screen, but it can be a pain to load the output into excel or a MySQL database because all the fields are separated by a variable number of spaces. This little command line trick will convert a variable-space delimited PLINK output file to a comma delimited file.

You need to be on a Linux/Unix machine to do this. Here's the command. I'm looking at results from Mendelian errors here. Replace "mendel" with the results file you want to reformat, and put this all on one line.

cat mendel.txt | sed -r 's/^\s+//g' | sed -r 's/\s+/,/g' > mendel.csv

You'll have created a new file called results.hwe.csv that you can now open directly in Excel or load into a database more easily than you could with the default output.

Before:


turnersd@provolone~: cat mendel.txt
FID        PAT        MAT   CHLD    N
1089   16223073   16223062      1  149
1116   16233564   16233589      1  114
123   16230767   16230725      2  189
12   16221778   16221803      1  116
12   16221805   16221822      1   98
12   16230486   16230496      1   76
12   16231205   16232111      2  180
134   16222939   16222945      2  140
1758   16230755   16231121      2  206

After:


turnersd@provolone~: cat mendel.csv
FID,PAT,MAT,CHLD,N
1089,16223073,16223062,1,149
1116,16233564,16233589,1,114
123,16230767,16230725,2,189
12,16221778,16221803,1,116
12,16221805,16221822,1,98
12,16230486,16230496,1,76
12,16231205,16232111,2,180
134,16222939,16222945,2,140
1758,16230755,16231121,2,206

If you're interested in the details of what this is doing here you go:

First, you cat the contents of the file and pipe it to a command called sed. The thing between the single quotes in the sed command is called a regular expression, which is similar to doing a find-and-replace in MS Word. What this does is searches for the thing between the first pair of slashes and replaces it with the thing between the next two slashes. You need the -r option, and the "s" before the first and the "g" after the last slash to make it work right.

/^\s+// is the first regular expression. \s is special and it means means search for whitespace. \s+ means search for any amount of whitespace. The ^ means only look for it at the beginning of the line. Notice there is nothing between the second and third slashes, so it will replace any whitespace with nothing. This part will trim any whitespace from the beginning of the line, which is important because in the next part we're turning any remaining whitespace into a comma, so we don't want the line to start with a comma.

/\s+/,/ is the second regular expression. Again we're searching for a variable amount of whitespace but this time replacing it with a comma.

This blog has moved!