Showing posts with label Statistics. Show all posts
Showing posts with label Statistics. Show all posts

Tuesday, September 16, 2014

R package to convert statistical analysis objects to tidy data frames

I talked a little bit about tidy data my recent post about dplyr, but you should really go check out Hadley’s paper on the subject.
R expects inputs to data analysis procedures to be in a tidy format, but the model output objects that you get back aren’t always tidy. The reshape2, tidyr, and dplyr are meant to take data frames, munge them around, and return a data frame. David Robinson's broom package bridges this gap by taking un-tidy output from model objects, which are not data frames, and returning them in a tidy data frame format.
(From the documentation): if you performed a linear model on the built-in mtcars dataset and view the object directly, this is what you’d see:
lmfit = lm(mpg ~ wt, mtcars)
lmfit
Call:
lm(formula = mpg ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt  
     37.285       -5.344
summary(lmfit)
Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-4.543 -2.365 -0.125  1.410  6.873 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   37.285      1.878   19.86  < 2e-16 ***
wt            -5.344      0.559   -9.56  1.3e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.05 on 30 degrees of freedom
Multiple R-squared:  0.753,  Adjusted R-squared:  0.745 
F-statistic: 91.4 on 1 and 30 DF,  p-value: 1.29e-10
If you’re just trying to read it this is good enough, but if you’re doing other follow-up analysis or visualization, you end up hacking around with str() and pulling out coefficients using indices, and everything gets ugly quick.
But the tidy function in the broom package run on the fit object probably gives you what you were looking for in a tidy data frame:
tidy(lmfit)
         term estimate stderror statistic   p.value
1 (Intercept)   37.285   1.8776    19.858 8.242e-19
2          wt   -5.344   0.5591    -9.559 1.294e-10
The tidy() function also works on other types of model objects, like those produced by glm() and nls(), as well as popular built-in hypothesis testing tools like t.test(), cor.test(), or wilcox.test().
View the README on the GitHub page, or install the package and run the vignette to see more examples and conventions.

Friday, June 13, 2014

An Annotated Online Bioinformatics / Computational Biology Curriculum

Two years ago David Searls published an article in PLoS Comp Bio describing a series of online courses in bioinformatics. Yesterday, the same author published an updated version, "A New Online Computational Biology Curriculum," (PLoS Comput Biol 10(6): e1003662. doi: 10.1371/journal.pcbi.1003662).

This updated curriculum has a supplemental PDF describing hundreds of video courses that are foundational to a good understanding of computational biology and bioinformatics. The table of contents embedded into the PDF's metadata (Adobe Reader: View>Navigation Panels>Bookmarks; Apple Preview: View>Table of Contents) breaks the curriculum down into 11 "departments" with links to online courses in each subject area:

  1. Mathematics Department
  2. Computer Science Department
  3. Data Science Department
  4. Chemistry Department
  5. Biology Department
  6. Computational Biology Department
  7. Evolutionary Biology Department
  8. Systems Biology Department
  9. Neurosciences Department
  10. Translational Sciences Department
  11. Humanities Department
Listings in the catalog can take one of three forms: Courses, Current Topics, or Seminars. All listed courses are video-based and free of charge, many being MOOCs offered by Coursera or edX.

More than just a link dump, the author of this paper has "road tested" most of the courses, having enrolled in up to a dozen at a time. Under each course listing the author offers commentary on the importance of the subject to a computational biology education and an opinion on the quality of instruction. (The author ranks in the top 50 students on Coursera in terms of the number of completed courses on Coursera!). For the courses that the author completed, listings have an "evaluation" section, which ranks the course in difficulty, time requirements, lecture/homework effectiveness, assessment quality, and overall opinions. The author also injects autobiographical annotations as to why he thinks the courses are useful in a bioinformatics career. 

In summary, years of work went into creating this annotated course catalog, and is a great resource if you're looking to get started in a computational biology career, or if you're just looking to brush up on individual topics ranging from natural language processing to evolutionary theory.


Wednesday, January 22, 2014

Coursera Specializations: Data Science, Systems Biology, Python Programming

I first mentioned Coursera about a year ago, when I hired a new analyst in my core. This new hire came in as a very competent Python programmer with a molecular biology and microbial ecology background, but with very little experience in statistics. I got him to take Roger Peng's Computing for Data Analysis course and Jeff Leek's Data Analysis course, and four weeks later he was happily doing statistical analysis in R for gene expression experiments for several of our clients.

Today, Coursera announced Specializations - sequences of courses offered by the same institution, with the option of earning a specialization certificate from the University teaching the courses upon successful completion.

Among others, several specializations that look particularly interesting are:

Johns Hopkins University's Data Science Specialization

This course, one of the longer specializations, is taught by Brian Caffo, Roger Peng, and Jeff Leek at Johns Hopkins. The courses in the specialization include:


  • The Data Scientist’s Toolbox
  • R Programming
  • Getting and Cleaning Data
  • Exploratory Data Analysis
  • Reproducible Research
  • Statistical Inference
  • Regression Models
  • Practical Machine Learning
  • Developing Data Products
  • A final Capstone Project



  • Systems Biology (Icahn School of Medicine at Mount Sainai)

    Courses include:


  • Introduction to Systems Biology
  • Network Analysis in Systems Biology
  • Dynamical Modeling Methods for Systems Biology
  • Integrated Analysis in Systems Biology
  • A final Capstone Project



  • Fundamentals of Computing (Rice University)

    Courses include:


  • An Introduction to Interactive Programming in Python
  • Principles of Computing
  • Algorithmic Thinking
  • A final Capstone Project



  • Check out the Coursera Specializations page for other Coursera series.

    Tuesday, December 31, 2013

    Jeff Leek's non-comprehensive list of awesome things other people did in 2013

    Jeff Leek, biostats professor at Johns Hopkins and instructor of the Coursera Data Analysis course, recently posted on Simly Statistics this list of awesome things other people accomplished in 2013 in genomics, statistics, and data science.

    At risk of sounding too meta, I'll say that this list itself is one of the awesome things that was put together in 2013. You should go browse the entire post for yourself, but I'll highlight a few that I saved to my reading list:


    This only a sample of what's posted on Jeff's blog. Go read the full post below.

    Simply Statistics: A non-comprehensive list of awesome things other people did this year.

    Thursday, May 30, 2013

    PLATO, an Alternative to PLINK

    Since the near beginning of genome-wide association studies, the PLINK software package (developed by Shaun Purcell’s group at the Broad Institute and MGH) has been the standard for manipulating the large-scale data produced by these studies.  Over the course of its development, numerous features and options were added to enhance its capabilities, but it is best known for the core functionality of performing quality control and standard association tests. 

    Nearly 10 years ago (around the time PLINK was just getting started), the CHGR Computational Genomics Core (CGC) at Vanderbilt University started work on a similar framework for implementing genotype QC and association tests.  This project, called PLATO, has stayed active primarily to provide functionality and control that (for one reason or another) is unavailable in PLINK.  We have found it especially useful for processing ADME and DMET panel data – it supports QC and association tests of multi-allelic variants.    

    PLATO runs via command line interface, but accepts a batch file that allows users to specify an order of operations for QC filtering steps.  When running multiple QC steps in a single run of PLINK, the order of application is hard-coded and not well documented.  As a result, users wanting this level of control must run a sequence of PLINK commands, generating new data files at each step leading to longer compute times and disk usage.  PLATO also has a variety of data reformatting options for other genetic analysis programs, making it easy to run EIGENSTRAT, for example.

    The detail of QC output from each of the filtering steps is much greater in PLATO, allowing output per group (founders only, parents only, etc), and giving more details on why samples fail sex checks, Hardy-Weinberg checks, and Mendelian inconsistencies to facilitate deeper investigation of these errors.  And with family data, disabling samples due to poor genotype quality retains pedigree information useful for phasing and transmission tests. Full documentation and download links can be found here (https://chgr.mc.vanderbilt.edu/plato).  Special thanks to Yuki Bradford in the CGC for her thoughts on this post.  

    Monday, January 28, 2013

    Scotty, We Need More Power! Power, Sample Size, and Coverage Estimation for RNA-Seq

    Two of the most common questions at the beginning of an RNA-seq experiments are "how many reads do I need?" and "how many replicates do I need?". This paper describes a web application for designing RNA-seq applications that calculates an appropriate sample size and read depth to satisfy user-defined criteria such as cost, maximum number of reads or replicates attainable, etc. The power and sample size estimations are based on a t-test, which the authors claim, performs no worse than the negative binomial models implemented by popular RNA-seq methods such as DESeq, when there are three or more replicates present. Empirical distributions are taken from either (1) pilot data that the user can upload, or (2) built in publicly available data. The authors find that there is substantial heterogeneity between experiments (technical variation is larger than biological variation in many cases), and that power and sample size estimation will be more accurate when the user provides their own pilot data.

    My only complaint, for all the reasons expressed in my previous blog post about why you shouldn't host things like this exclusively on your lab website, is that the code to run this analysis doesn't appear to be available to save, study, modify, maintain, or archive. When lead author Michele Busby leaves Gabor Marth's lab, hopefully the app doesn't fall into the graveyard of computational biology web apps Update 2/7/13: Michele Busby created a public Github repository for the Scotty code: https://github.com/mbusby/Scotty

    tl;dr? There's a new web app that does power, sample size, and coverage calculations for RNA-seq, but it only works well if the pilot or public data you give it closely matches the actual data you'll collect. 



    Tuesday, August 28, 2012

    More on Exploring Correlations in R

    About a year ago I wrote a post about producing scatterplot matrices in R. These are handy for quickly getting a sense of the correlations that exist in your data. Recently someone asked me to pull out some relevant statistics (correlation coefficient and p-value) into tabular format to publish beside a scatterplot matrix. The built-in cor() function will produce a correlation matrix, but what if you want p-values for those correlation coefficients? Also, instead of a matrix, how might you get these statistics in tabular format (variable i, variable j, r, and p, for each i-j combination)? Here's the code (you'll need the PerformanceAnalytics package to produce the plot).


    The cor() function will produce a basic correlation matrix.  12 years ago Bill Venables provided a function on the R help mailing list for replacing the upper triangle of the correlation matrix with the p-values for those correlations (based on the known relationship between t and r). The cor.prob() function will produce this matrix.

    Finally, the flattenSquareMatrix() function will "flatten" this matrix to four columns: one column for variable i, one for variable j, one for their correlation, and another for their p-value (thanks to Chris Wallace on StackOverflow for helping out with this one).



    Finally, the chart.Correlation() function from the PerformanceAnalytics package produces a very nice scatterplot matrix, with histograms, kernel density overlays, absolute correlations, and significance asterisks (0.05, 0.01, 0.001):

    Friday, August 5, 2011

    Friday Links: R, OpenHelix Bioinformatics Tips, 23andMe, Perl, Python, Next-Gen Sequencing

    I haven't posted much here recently, but here is a roundup of a few of the links I've shared on Twitter (@genetics_blog) over the last two weeks.

    Here is a nice tutorial on accessing high-throughput public data (from NCBI) using R and Bioconductor.

    Cloudnumbers.com, a startup that allows you to run high-performance computing (HPC) applications in the cloud, now supports the previously mentioned R IDE, RStudio.

    23andMe announced a project to enroll 10,000 African-Americans for research by giving participants their personal genome service for free. You can read about it here at 23andMe or here at Genetic Future.

    Speaking of 23andMe, they emailed me a coupon code (8WR9U9) for getting $50 off their personal genome service, making it $49 instead of $99. Not sure how long it will last.

    I previously took a poll which showed that most of you use Mendeley to manage your references. Mendeley recently released version 1.0, which includes some nice features like duplicate detection, better library organization (subfolders!), and a better file organization tool. You can download it here.

    An interesting blog post by Michael Barton on how training and experience in bioinformatics leads to a wide set of transferable skills.

    Dienekes releases a free DIY admixture program to analyze genomic ancestry.

    A few tips from OpenHelix: the new SIB Bioinformatics Resource Portal, and testing correlation between SNPs and gene expression using SNPexp.

    A nice animation describing a Circos plot from PacBio's E. coli paper in NEJM.

    The Court of Appeals for the Federal Circuit reversed the lower court's invalidation of Myriad Genetics' patents on BRCA1/2, reinstating most of the claims in full force. Thoughtful analysis from Dan Vorhaus here.

    Using the Linux shell and perl to delete files in the current directory that don't contain the right number of lines: If you want to get rid of all files in the current directory that don't have exactly 42 lines, run this code at the command line (*be very careful with this one!*): for f in *.txt;do perl -ne 'END{unlink $ARGV unless $.==42}' ${f} ;done

    The previously mentioned Hitchhiker's Guide to Next-Generation Sequencing by Gabe Rudy at Golden Helix is now available in PDF format here. You can also find the related post describing all the various file formats used in NGS in PDF format here.

    The Washington Post ran an article about the Khan Academy (http://www.khanacademy.org/), which has thousands of free video lectures, mostly on math. There are also a few computer science lectures that teach Python programming. (Salman Khan also appeared on the Colbert Report a few months ago).

    Finally, I stumbled across this old question on BioStar with lots of answers about methods for short read mapping with next-generation sequencing data.

    ...

    And here are a few interesting papers I shared:

    Nature Biotechnology: Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly

    PLoS Genetics: Gene-Based Tests of Association

    PLoS Genetics: Fine Mapping of Five Loci Associated with Low-Density Lipoprotein Cholesterol Detects Variants That Double the Explained Heritability

    Nature Reviews Genetics: Systems-biology approaches for predicting genomic evolution


    Genome Research: A comprehensively molecular haplotype-resolved genome of a European individual (paper about the importance of phase in genetic studies)

    Nature Reviews Microbiology: Unravelling the effects of the environment and host genotype on the gut microbiome.
    ...

    Tuesday, March 8, 2011

    Splitting a Dataset Revisited: Keeping Covariates Balanced Between Splits

    In my previous post I showed you how to randomly split up a dataset into training and testing datasets. (Thanks to all those who emailed me or left comments letting me know that this could be done using other means. As things go with R, it's sometimes easier to write a new function yourself than it is to hunt down the function or package that already exists.)

    What if you wanted to split a dataset into training/testing sets but ensure that there are no significant differences between a variable of interest across the two splits?

    For example, if we use the splitdf() function from last time to split up the iris dataset, setting the random seed to 44, it turns out the outcome variable of interest, Sepal.Length, differs significantly between the two splits.

    splitdf <- function(dataframe, seed=NULL) {
        if (!is.null(seed)) set.seed(seed)
        index <- 1:nrow(dataframe)
        trainindex <- sample(index, trunc(length(index)/2))
        trainset <- dataframe[trainindex, ]
        testset <- dataframe[-trainindex, ]
        list(trainset=trainset,testset=testset)
    }
    
    data(iris)
    s44 <- splitdf(iris, seed=44)
    train <- s1$trainset
    test <- s1$testset
    t.test(train$Sepal.Length, test$Sepal.Length)

    What if we wanted to ensure that the means of Sepal.Length, as well as the other continuous variables in the dataset, do not differ between the two splits?

    Again, this is probably something that's already available in an existing package, but I quickly wrote another function to do this. It's called splitdf.randomize(), which depends on splitdf() from before. Here, you give splitdf.randomize() your data frame you want to split, and a character vector containing all the columns you want to keep balanced between the two splits. The function is a wrapper for splitdf(). It randomly makes a split and does a t-test on each column you specify. If the p-value on that t-test is less than 0.5 (yes, 0.5, not 0.05), then the loop will restart and try splitting the dataset again. (Currently this only works with continuous variables, but if you wanted to extend this to categorical variables, it wouldn't be hard to throw in a fisher's exact test in the while loop)

    For each iteration, the function prints out the p-value for the t-test on each of the variable names you supply. As you can see in this example, it took four iterations to ensure that all of the continuous variables were evenly distributed among the training and testing sets. Here it is in action:

    Tuesday, February 22, 2011

    Get all your Questions Answered

    When I have a question I usually ask the internet before bugging my neighbor. Yet it seems like Google's search results have become increasingly irrelevant over the last few years, and this is especially true for searching anything related to R (and previously mentioned Rseek.org doesn't really do the job I would expect it to do either).

    The last few years has seen the development of several community-powered Q&A websites, and I'm not talking about Yahoo Answers. Here are a few that come to mind that I've used and found extremely helpful.

    Biostar (biostars.org) - a Q&A site for bioinformatics. The site's focus is bioinformatics, computational genomics and biological data analysis. A few of my favorite threads from this site are one on mapping SNPs to pathways, and another on mapping SNPs to genes using tools like the UCSC public MySQL server.

    CrossValidated (http://stats.stackexchange.com/) - a Q&A site for for statisticians, data miners, and anyone else doing data analysis. This one's relatively new but already has many very talented and extremely helpful users. Last week I asked a question about R², about the difference between variance explained and variation explained, and how that related to Random Forests. The question was answered merely a few hours later.

    Finally, there's Quora (http://www.quora.com/). Quora's a little different from the others, and you can ask just about anything you want here. Quora's also still young, but seems to have lots of science/tech geeks like us using it. I recently asked a question, requesting a lay explanation of how Random Forest works, and got a great answer. There was also a good thread about whether current customers found 23andMe to be worth buying.

    There's an FAQ on all of these sites that explains how to ask a good question. You might even try answering a few questions yourself and find it rewarding. It's a lot like playing a game, with rather odd goals. You get reputation points and "badges" for answering questions, having your answers voted on, commenting on others' answers, etc. You'll also find that as your own reputation increases by providing good answers to others' questions, your own questions will be answered more quickly. If none of these are quite what you're looking for, check out the stackexchange directory. You'll find Q&A sites that all use the same engine dedicated to topics from photography or cooking to programming and web development.

    *Edit 2011-02-22* Thanks to two commenters for pointing this out. There's also a good Q&A community for next generation sequencing, including a forum (http://seqanswers.com/) and a StackExchange site (http://i.seqanswers.com/)

    Monday, January 10, 2011

    R function for extracting F-test P-value from linear model object

    I thought it would be trivial to extract the p-value on the F-test of a linear regression model (testing the null hypothesis R²=0). If I fit the linear model: fit<-lm(y~x1+x2), I can't seem to find it in names(fit) or summary(fit). But summary(fit)$fstatistic does give you the F statistic, and both degrees of freedom, so I wrote this function to quickly pull out the p-value from this F-test on a lm object, and added it to my R profile. If there's a built-in R function to do this, please comment!

    Monday, December 6, 2010

    Using the "Divide by 4 Rule" to Interpret Logistic Regression Coefficients

    I was recently reading a bit about logistic regression in Gelman and Hill's book on hierarchical/multilevel modeling when I first learned about the "divide by 4 rule" for quickly interpreting coefficients in a logistic regression model in terms of the predicted probabilities of the outcome. The idea is pretty simple. The logistic curve (predicted probabilities) is steepest at the center where a+ßx=0, where logit-1(x+ßx)=0.5. See the plot below (or use the R code to plot it yourself).



    The slope of this curve (1st derivative of the logistic curve) is maximized at a+ßx=0, where it takes on the value:

    ße0/(1+e0

    =ß(1)/(1+1)²

    =ß/4

    So you can take the logistic regression coefficients (not including the intercept) and divide them by 4 to get an upper bound of the predictive difference in probability of the outcome y=1 per unit increase in x. This approximation the best at the midpoint of x where predicted probabilities are close to 0.5, which is where most of the data will lie anyhow.

    So if your regression coefficient is 0.8, a rough approximation using the ß/4 rule is that a 1 unit increase in x results in about a 0.8/4=0.2, or 20% increase in the probability of y=1.

    Friday, June 11, 2010

    Seminar: T-Test on Fold Changes

    I've had friends in biochem "wet" labs who've asked me to do some simple statistics on some of their results. This looks like an interesting seminar to attend if you've ever thought about doing a t-test on fold changes in some outcome measure between treatment and control groups, a pretty common outcome in biochemical assays. If the speaker provides slides electronically I'll happily post them here after the seminar.

    Department of Biostatistics Seminar/Workshop Series:

    T-Test on Fold Changes

    Tatsuki Koyama, PhD
    Assistant Professor of Biostatistics, Cancer Biostatistics Center, Vanderbilt-Ingram Cancer Center
    Wednesday, June 16, 1:30-2:30pm, MRBIII Conference Room 1220

    Basic science experiments often use a separate control group for each treatment group. Typically, the treatment group outcomes are scaled by the average of the corresponding control group outcomes. Despite its overwhelming popularity, this "fold change" method has serious statistical problems resulting in reduced validity. When the implicit variability of the control group outcomes is ignored, a large type I error inflation can result. Likewise, this scaling induces correlation and can substantially inflate the type I error when this correlation is ignored. We present simulations showing that this inflation results in type I error rates as high as 50% in everyday settings. We propose some computational and analytical approaches for dealing with this problem, and we present some practical recommendations for experimental designs with small sample sizes. Intended audience: Clinical and basic science researchers and statisticians.

    Thursday, May 20, 2010

    Tutorial: Principal Components Analysis (PCA) in R

    Found this tutorial by Emily Mankin on how to do principal components analysis (PCA) using R. Has a nice example with R code and several good references. The example starts by doing the PCA manually, then uses R's built in prcomp() function to do the same PCA.

    Principle Components Analysis: A How-To Manual for R

    Thursday, May 13, 2010

    Using R, LaTeX, and Sweave for Reproducible Research: Handouts, Templates, & Other Resources

    Several readers emailed me or left a comment on my previous announcement of Frank Harrell's workshop on using Sweave for reproducible research asking if we could record the seminar. Unfortunately we couldn't record audio or video, but take a look at the Sweave/Latex page on the Biostatistics Dept Wiki. Here you can find Frank's slideshow from today and the handout from today (a PDF statistical report with all the LaTeX/R code necessary to produce it). While this was more of an advanced Sweave/LaTeX workshop, you can also find an introduction to LaTeX and an introduction to reproducible research using R, LaTeX, and Sweave, both by Theresa Scott.

    In addition to lots of other helpful tips, you'll also find the following resources to help you learn to use both Sweave and LaTeX:

    Tuesday, May 11, 2010

    R Package 'rms' for Regression Modeling

    If you attended Frank Harrell's Regression Modeling Strategies course a few weeks ago, you got a chance to see the rms package for R in action. Frank's rms package does regression modeling, testing, estimation, validation, graphics, prediction, and typesetting by storing enhanced model design attributes in the fit. rms is a re-written version of the Design package that has improved graphics and duplicates very little code in the survival package.

    First install the rms package:
    install.packages("rms")

    Next, load the package:
    library(rms)

    Finally, run this command to get very extensive documentation of many of the features in the rms package:
    ?rmsOverview

    You can also get a walk-through consisting of several example uses of functions in the rms package for modeling and graphics.
    example(rmsOverview)

    Wednesday, April 21, 2010

    Checklist: Statistical Problems to Document and Avoid

    Update 2010-04-21: I forgot to post the link last time. That would have been helpful. Here you go:

    Vanderbilt Biostatistics: Statistical Problems to Document and to Avoid

    .....

    At the Regression Modeling Strategies course I attended a few weeks ago, Frank Harrell pointed out the checklist on the biostatistics department's website of statistical problems to document and avoid. It was recommended that authors of any paper employing statistical analysis should go through this checklist before writing and submitting a manuscript.  Some of the topics include:

    • Design and sample size issues
    • Inefficient use of continuous variables (don't categorize!)
    • Assumptions of parametric tests
    • Inappropriate analysis of repeated measures data
    • P-value interpretation
    • Filtering results
    • Missing data
    • Multiple testing concerns
    • Model building and specification
    • Use of stepwise variable selection (don't do it)
    • Overfitting
    Be sure to check this out before writing up results, and ideally before you even plan any experiments, especially if you are relatively new to quantitative analysis.

    Wednesday, April 14, 2010

    Cancer Biostatistics Workshop: Overfitting

    This month's cancer biostatistics workshop on overfitting will be given by Fei Ye and Zhiguo (Alex) Zhao, both in the Department of Biostatistics and the Cancer Biostatistics Center. This looks like a good one, especially after attending Frank Harrell's regression modeling strategies course a few weeks ago. See the link below for the full 2010 series.

    2010 Cancer Biostatistics Works Series

    Tuesday, April 13, 2010

    Efficient Mixed-Model Association in GWAS using R

    I recently did an analysis for the eMERGE network where I had lots of individuals from a small town in central Wisconsin where many of the subjects were related to one another. The subjects could not be treated as independent, but I could not use a family-based design either. I ended up using a mixed model approach using previously mentioned GenABEL. You can read about the method here (PubMed).

    While researching which methods to use, I ran into what could be a potential problem. All of the methods that examine relatedness (including the method mentioned above), assume you have an ethnically homogeneous population. Yet all of the methods which look for population stratification (Eigenstrat, Structure, etc) assume samples are unrelated. So what do you do if you have both population stratification AND a high level of relatedness among your samples?

    A few weeks ago our graduate student association invited and hosted Dr. Elaine Ostrander here from the NIH to talk about her work with gene mapping in dogs. She mentioned a method she used called Efficient Mixed-Model Association (EMMA) for performing association mapping while simultaneously correcting for relatedness and population structure. Using multiple highly inbred dog breeds represents the extreme case of simultaneously having to deal with substructure, inbreeding, and relatedness. If this method works for association mapping combining several purebred dog breeds, it should work for a less problematic human dataset as well.

    EMMA is also implemented in R. You can download the necessary R package from the project's website below.

    Efficient Mixed-Model Association (EMMA) website

    PubMed: Efficient control of population structure in model organism association mapping.

    Abstract: Genomewide association mapping in model organisms such as inbred mouse strains is a promising approach for the identification of risk factors related to human diseases. However, genetic association studies in inbred model organisms are confronted by the problem of complex population structure among strains. This induces inflated false positive rates, which cannot be corrected using standard approaches applied in human association studies such as genomic control or structured association. Recent studies demonstrated that mixed models successfully correct for the genetic relatedness in association mapping in maize and Arabidopsis panel data sets. However, the currently available mixed-model methods suffer from computational inefficiency. In this article, we propose a new method, efficient mixed-model association (EMMA), which corrects for population structure and genetic relatedness in model organism association mapping. Our method takes advantage of the specific nature of the optimization problem in applying mixed models for association mapping, which allows us to substantially increase the computational speed and reliability of the results. We applied EMMA to in silico whole-genome association mapping of inbred mouse strains involving hundreds of thousands of SNPs, in addition to Arabidopsis and maize data sets. We also performed extensive simulation studies to estimate the statistical power of EMMA under various SNP effects, varying degrees of population structure, and differing numbers of multiple measurements per strain. Despite the limited power of inbred mouse association mapping due to the limited number of available inbred strains, we are able to identify significantly associated SNPs, which fall into known QTL or genes identified through previous studies while avoiding an inflation of false positives. An R package implementation and webserver of our EMMA method are publicly available.

    Tuesday, April 6, 2010

    ProbABEL - R package for GWAS data imputation

    I've been using GenABEL for some time now for GWAS analysis using related individuals. It has an excellent set of functions for estimating a kinship matrix from a dense marker panel and then using this in a linear mixed effects model to allow for related individuals in the analysis of a quantitative trait. GenABEL also has many other nice features for analysis and visualization of GWAS data that you can't find in PLINK, it's free, cross-platform, and implemented in R. I'll write another post about GenABEL later, but here I wanted to note that GenABEL's creator, Yurii Aulchenko, released another package called ProbABEL for genome-wide association of imputed data. ProbABEL can perform imputation analyzing quantitative, binary, and survival outcomes while taking imputation uncertainty into account.

    BMC Bioinformatics - ProbABEL package for genome-wide association analysis of imputed data

    GenABEL homepage

    GenABEL tutorial and reference manual

    ProbABEL manual
    Creative Commons License
    Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.