Getting Genetics Done: Stata

Showing posts with label Stata. Show all posts

Tuesday, February 23, 2010

Happy Birthday GGD! The 10 Most Popular Posts Since GGD's Launch

The first post on Getting Genetics Done was one year ago today. To celebrate, here are the top 10 most viewed posts since GGD launched last year. Incidentally, nine of the ten are tutorials on how to do something in R. Thanks to all the readers and all the commenters for sharing your thoughts!

ggplot2 Tutorial: Scatterplots in a Series of Small Multiples

GWAS Manhattan plots and QQ plots using ggplot2 in R

Hierarchical Clustering in R (by Will Bush)

Comparison of plots using Stata, R base, R lattice, and R ggplot2: Histograms

Merge data from different files using R

QQ plots of p-values in R using ggplot2

PDF tutorial from R course (Introduction to R)

Visualizing sample relatedness in a GWAS using PLINK and R

Split, apply, and combine in R using PLYR

Linux Command Line Cheat Sheet

Monday, January 18, 2010

Genome-wide Manhattan Plots in STATA

I have used this chunk of code on numerous occasions to plot GWAS data, so I thought I'd share!

twoway scatter logpval absposMB, msize(tiny) ysize(2) xsize(7) xaxis(1 2) yline(16.881438 2.9957323, lcolor(red))
xline(245.522847 488.541076 688.046816 879.458034 1060.3159 1231.291599 1389.919738 1536.194564 1674.623832 1810.03746 1944.489844 2076.939655 2191.082635 2297.45122 2397.790135 2486.617389 2565.392131 2641.509284 2705.320935 2767.756899 2814.701222 2864.255932, lpattern(dash) lcolor(gs12))
xlabel(122 "1" 367 "2" 588 "3" 783 "4" 969 "5" 1145 "6" 1310 "7" 1463 "8" 1605 "9" 1742 "10" 1877 "11" 2010 "12" 2134
"13" 2244 "14" 2347 "15" 2442 "16" 2526 "17" 2603 "18" 2673 "19" 2736 "20" 2791 "21" 2839 "22" 2941 "X", axis(2))
xlabel( 0(50)3000, axis(1) alternate) subtitle(,bcolor(white) ring(30) orientation(vertical) position(9) nobexpand) ytitle("-Log p-value") xtitle("Physical Distance (Mb)", axis(1))

The variables needed are a log p-value (or some other statistic) and the absolute genomic position of each SNP (distance from the beginning of chromosome 1). If you need the offsets to compute this absolute position, they are listed in MB in the xline(---) portion of the plot. It makes something like this:

Enjoy!

Monday, September 21, 2009

Comparison of plots using Stata, R base, R lattice, and R ggplot2, Part I: Histograms

One of the nicer things about many statistics packages is the extremely granular control you get over your graphical output. But I lack the patience to set dozens of command line flags in R, and I'd rather not power the computer by pumping the mouse trying to set all the clicky-box options in Stata's graphics editor. I want something that just looks nice, using the out-of-the-box defaults. Here's a little comparison of 4 different graphing systems (three using R, and one using Stata) and their default output for plotting a histogram of a continuous variable split over three levels of a categorical variable.

First I'll start with the three graphing systems in R: base, lattice, and ggplot2. If you don't have the last two packages installed, go ahead and download them:

install.packages("ggplot2")

install.packages("lattice")

Now load these two packages, and download this fake dataset I made up containing 100 samples each from three different genotypes ("geno") and a continuous outcome ("trait")

mydat=read.csv("http://people.vanderbilt.edu/~stephen.turner/ggd/2009-09-21-histodemo.csv",header=T)

library(ggplot2)

library(lattice)

Now let's get started...

R: base graphics

par(mfrow=c(3,1))

with(subset(mydat,geno=="aa"),hist(trait))

with(subset(mydat,geno=="Aa"),hist(trait))

with(subset(mydat,geno=="AA"),hist(trait))

R: lattice

histogram(~trait | factor(geno), data=mydat, layout=c(1,3))

R: ggplot2

qplot(trait,data=mydat,facets=geno~.)

# Update Tuesday, September 22, 2009
# A commenter mentioned that this code did not work.
# If the above code does not work, try explicitly
# stating that you want a histogram:
qplot(trait,geom="histogram",data=mydat,facets=geno~.)

Stata

insheet using "http://people.vanderbilt.edu/~stephen.turner/ggd/2009-09-21-histodemo.csv", comma clear
histogram trait, by(geno, col(1))

Commentary

In my opinion ggplot2 is the clear winner. Again I'll concede that all of the above graphing systems give you an incredible amount of control of every aspect of the graph, but I'm only looking for what gives me the best out-of-the-box default plot using the shortest command possible. R's base graphics give you a rather spartan plot, with very wide bins. It also requires 4 lines of code. (If you can shorten this, please comment). By default, the base graphics system gives you counts (frequency) on the vertical axis. The lattice package in R does a little better perhaps, but the default color scheme is visually less than stellar. Also, I'm not sure why the axis labels switch sides every other plot, and the ticks on top of the plot are probably unnecessary. I still think the bins are too wide. You lose some information especially on the bottom plot towards the right tail. The vertical axis is proportion of total. Stata's default plot looks very similar to lattice, but again uses a very unattractive color scheme. It uses density for the vertical axis, which may not mean much to non-statisticians. The default plot made by ggplot2 is just hands-down good-looking. There are no unnecessary lines delimiting the bins, and the binwidth is appropriate. The vertical axis represents counts. The black bars on the light-gray background have a good data-ink ratio. And it required the 2nd shortest command, only 3 characters longer than the Stata equivalent.

I'm ordering the ggplot2 book (Amazon, ~$50), so as I figure out how to do more with ggplot2 I'll post more comparisons like this. If you use SPSS, SAS, MATLAB, or something else, post the code in a comment here and send me a picture or link to the plot and I'll post it here.

Monday, July 27, 2009

A Primer on Logistic Regression (using Stata)

This handy introduction comes from UCLA's Statistical Computing webbook on logistic regression. If you're familiar with linear regression but you've never used logistic regression before, this is a good place to start. In addition to laying down some basic terminology and theory, this tutorial shows you how to access their example dataset, and gives you the Stata commands they use so you can follow along through their examples.

Introduction to Logistic Regression with Stata

If you're not sure when to use logistic regression, check out the two previous posts on choosing the right analysis for your data: part I and part II.

Monday, June 15, 2009

Side by side analyses in Stata, SPSS, SAS, and R

I've linked to UCLA's stat computing resources once before on a previous post about choosing the right analysis for the questions your asking and the data types you have. Here's another section of the same website that has code to run an identical analysis in all of these statistical packages, with examples to walk through (as they note here, just because they don't list code for a particular software for an analysis doesn't mean the software can't do it). They also have examples of power calculations for a smattering of statistics using previously mentioned G*power in addition to others.

UCLA Stat computing - data analysis examples with Stata, SPSS, SAS, R, and others

Monday, June 8, 2009

Make Pretty Regression Tables in Stata

The estout package for Stata is useful for quickly creating nicely formatted tables from a regression analysis for tables or papers. To install it, fire up Stata and type in this command:

ssc install estout, replace

Stata will automatically download and install the package. Run the regression as you normally would, then use the esttab command (part of the estout package) to create a table using those results. See their examples to see how the command works.

Click the thumbnail below to check out the difference. The top half is Stata's default output. The bottom is estout's formatted output.

Monday, April 6, 2009

What's the Right Analysis, Part II

Will recently posted a set of flowcharts made by Marylyn, Jason, and Tricia, to help choose which statistical analysis to use for nearly any situation. UCLA has a very similar chart with links to Stata code, examples, and annotated output for every method they mention. Also, check out their Stata help homepage and Stata learning modules. They're excellent resources for learning how to use Stata for data analysis!

What statistical analysis should I use? (UCLA Stat Computing)

This blog has moved!