Thursday, April 30, 2009

Merging data from different files using R

A reader asked yesterday how you would merge data from two different files. For example, let's say you have a ped file with genotypes for individuals, and another file that had some other data for some of the individuals in the pedfile, like clinical or environmental covariates. How would you go about automatically putting the clinical data from file 2 into the appropriate rows in file 1? Without using a database, the easiest way is probably using the "merge" function in R that will do the trick with a one-line command. Here's a short tutorial to get you started.

First, start up R on the cheeses simply by typing in the uppercase letter R and hit enter. Alternatively you could use R if it's installed on your computer. Now type in this command to view a small dataset I made up. It's not a real pedfile, but it will do the trick. Make sure to put it all on one line:


You should see the dataset displayed:

indiv_id SNP1 SNP2
1 1 1 1
2 2 1 1
3 3 0 0
4 4 1 0
5 5 1 1
6 6 1 0
7 7 1 1
8 8 0 1
9 9 1 1
10 10 1 1

We have 10 individuals numbered 1-10 in a column called "indiv_id" that have genotype information at 2 loci. Now let's view another dataset that has fake clinical covariates for a few of those individuals:


Again, you will see the dataset displayed.

indiv_id cov1 cov2
1 1 1.14 74.6
2 3 4.50 79.4
3 4 0.80 48.2
4 6 1.39 68.1
5 7 3.89 84.8
6 8 0.42 86.1

In this dataset, we have clinical information on the individuals in the ped file above, but only on some of the individuals. Now, let's merge what information we have here with the ped file above.

But first, since all we did above was view the datasets, let's run those commands again, but this time storing the tables in a "data frame" (R parlance for dataset). Type in these two commands. The " <- " is an assignment. It means create a data frame called ped from the following table from the internet, and the same with a data frame called meta.

ped <- read.table("")

meta <- read.table("")

Now if you just type in the name of the data frame, e.g. ped, and hit enter, you will see the data frame displayed. Now it's time for the real magic. Type the following command, and I'll explain what it does below:


The output should look like this:

indiv_id SNP1 SNP2 cov1 cov2
1 1 1 1 1.14 74.6
2 2 1 1 NA NA
3 3 0 0 4.50 79.4
4 4 1 0 0.80 48.2
5 5 1 1 NA NA
6 6 1 0 1.39 68.1
7 7 1 1 3.89 84.8
8 8 0 1 0.42 86.1
9 9 1 1 NA NA
10 10 1 1 NA NA

This command uses R's "merge" function to combine the data frames you created above, ped and meta, into a single data frame. It does this by joining the two data frames on a common variable called "indiv_id". The all=TRUE option tells R to join the two data frames, even if one has fewer entries than another. You'll see the NA's listed for a few individuals. NA is R's way of coding missing data. If you didn't include the all=TRUE, observations with any missing data for any variable would be deleted!

Finally, if you want to save this data to a file, re-run the command above, assigning it to a variable, then use the write.table command to write it to your home directory.

superpedfile <- merge(ped,meta,by="indiv_id",all=TRUE)


To exit R, type quit(). To do this on your own data, see the help on read.table by typing ?read.table at the R command line. If anybody knows of a better way to do this post it in the comments!

Wednesday, April 29, 2009

X-cess of variants in XLMR

From a News and Views article in Nature Genetics, following up the recently mentioned sequencing effort for X-linked mental retardation:

"This study echoes old lessons from mendelian genetics that should be heard by other groups currently engaged in such large-scale sequencing surveys for discovering genes associated with human diseases. It is clear from the current work that a major challenge for this era of resequencing studies is discerning causative variation, as these will be accompanied by many other changes that each look as if they might perturb gene function. In the absence of clear functional criteria, compelling evidence of association is required to make the case that alleles are causative. As it is unlikely that there will be any simple single solution to providing functional data for all the discovered variation, the process of assigning phenotypes to newly discovered genotypes will get harder before it gets easier."

Nature Genetics: X-cess of variants in XLMR

Tuesday, April 28, 2009

Use plain text when you need to just write

When you need to focus and get some serious writing done, it may be a good idea to ditch your word processor and go with plain text instead. Save all your formatting and spell-checking to do later in one step. I just tried this out on a review article I needed to write and found it much easier to concentrate without all of MS Word's squiggly underlining, autocorrecting, autoformatting, and fancy toolbar buttons begging to be clicked. If you're using windows, Notepad works fine, but try out a tiny free program called Q10. It's a full screen text editor that by default uses easy-on-the-eyes light text on a black background. You can also run it directly from a USB stick. When you're done, just open the .txt file using Word, and from there out save it as .doc. By default it comes enabled with some silly typewriter sound when you type, but it's easy enough to turn that off.

Q10 - Full screen distraction-free text editor

Friday, April 24, 2009


I saw a demonstration of this tool at the workshop on network analysis I announced last week. Genes2Networks draws from a large background network consisting of several experimentally verified mammalian protein interaction databases. It will take a list of genes you provide as seed genes and identify all interacting genes that fall on paths through the background network between them. It then calculates a z-score for each intermediate node based on the number of links to the seed nodes and the number of total links it has in the background network, and color-codes the significant links.

It's free, and it took about 30 seconds to paste in and analyze a list of a few dozen genes I submitted. It looks like a great way to identify potential interaction partners for some genes you've already found. You can also export the networks you find to a text file that can be loaded with other software developed by this lab for more extensive statistical network analysis and visualization.


Publication in BMC Bioinformatics

Avi Ma'ayan's Lab Website

UPDATE 2009-04-27: I have slides from three one-hour lectures that Avi gave on network analysis following his seminar. Email me if you want them.

Wednesday, April 22, 2009

Sequencing is Not (Yet) the Silver Bullet

Amidst the fallout of an academic discussion over the worth of GWA studies followed by several gloomy and scathing articles in the popular press, came this paper in Nature Genetics. In summary, the investigators sequenced all the coding DNA on the X-chromosome in families affected with an evidently X-linked mental retardation phenotype. They found nearly 1000 changes that would either alter the amino acid sequence, introduce a frameshift or stop codon, or interfere with splicing, all illustrating the huge heterogeneity problem and the analytical conundrum to face when there are too many potential susceptibility/causal changes. What's more, nearly all of the protein-truncating variants they found were unique to individual families, and many of these were found in both affected and unaffected males! I believe this highlights the fact that an argument pushing for whole-genome sequencing should not omit a discussion of the analytical and interpretation issues we will have to deal with when the time comes.

Nature Genetics: A systematic, large-scale resequencing screen of X-chromosome coding exons in mental retardation

Abstract: Large-scale systematic resequencing has been proposed as the key future strategy for the discovery of rare, disease-causing sequence variants across the spectrum of human complex disease. We have sequenced the coding exons of the X chromosome in 208 families with X-linked mental retardation (XLMR), the largest direct screen for constitutional disease-causing mutations thus far reported. The screen has discovered nine genes implicated in XLMR, including SYP, ZNF711 and CASK reported here, confirming the power of this strategy. The study has, however, also highlighted issues confronting whole-genome sequencing screens, including the observation that loss of function of 1% or more of X-chromosome genes is compatible with apparently normal existence.

Tuesday, April 21, 2009

Save yourself a few keystrokes the next time you visit GGD. Going to will now automatically redirect you to the blogspot address. If you've bookmarked the site or subscribed to the site's feed with RSS, everything will still work.

Monday, April 20, 2009

Network analysis in systems biology workshop

Avi Ma'ayan, Ph.D., Assistant Professor in the Department of Pharmacology and Systems Therapeutics of Mount Sinai School of Medicine, will present "Network Analysis in Systems Biology", at 10:00 am on Wednesday, April 22 in Room 206 PRB.

Following the lecture, Dr. Ma’ayan will hold a series of one-hour interactive workshops on "Computational Methods for Analyzing Lists of Genes/Proteins and Building Networks in Systems Biology," which will highlight the software tools and databases he has helped to develop in the Systems Biology Center of New York. These computational tools will focus on modeling cellular pathways and using large-scale data in these models.

The workshop is being offered over 10 different time slots Wednesday and Thursday this week. See this link for more details about the schedule and information about registration.

Thursday, April 16, 2009

Linux Command Du Jour: time

I briefly mentioned "time" in the previously posted Linux command line cheat sheet, but I can't overstate its utility especially for ACCRE/Vampire users. The time command does exactly what it sounds like: it times exactly how long it takes to run anything at the command line. And it couldn't be easier to use. Just type the word "time" preceding what you would normally type in at the command line.

For example, instead of:

> ./sMDR run.cfg

Type this in instead:

> time ./sMDR run.cfg

The MDR analysis (or whatever other command) runs just as before, but now there will be a couple lines displayed telling you how long the command took to execute. (Depending on whether you're doing this on the local cheeses or on Vampire the output will look slightly different, but you want to look at the "real" time). If you don't have anything to run right now, fire up a terminal window and just try "time ls". Obviously it won't take long to run, but you can see how the time command works.

Where this becomes very useful is if you are preparing to run analyses on Vampire, where you must estimate the time it takes to execute a command or analysis. You can time smaller versions of a large analysis for a better estimate of how long something will take. For more information, type "man time" at the command line.

Monday, April 13, 2009

I can has power calculations?

What's your power to detect a recessive effect with an odds ratio of 1.2 for a disease with 4.2% prevalence using 1200 cases and 2900 controls? What if the allele is rare? Is it worth it, in terms of power gain, to genotype 1000 more individuals? How small of an effect can you detect with 80% power using the data you have? These questions and others can be answered by power and sample size calculations. While any respectable statistical computing software can do these, it's often far simpler and faster to use "canned" software for power analysis. While there are tons of these on the web, here are a few that I or others around here commonly use.

CaTS Power: This software from the Abecasis lab at Michigan couldn't be simpler to use. Nice GUI with clicky-boxes and sliders, made specifically for power calculations in genetic association studies using case-control study design. Has some nice extra features related to multi-stage designs, or designing penetrance tables for simulation.

PS Power: Made here at Vanderbilt by Bill Dupont in Biostatistics. Not as pretty as CaTS, but much more flexible, allowing for studies with dichotomous, continuous, or survival response measures, as well as matched designs.

G*Power: Probably the most flexible of the three mentioned here, allowing for more specific questions to be addressed, but at the cost of a slightly steeper learning curve.

Thursday, April 9, 2009

Epistasis Blog

If you're interested in gene-gene and gene-environment interaction (and who wouldn't be?), then you should check out the Epistasis Blog. Our friend and colleague Jason Moore at Dartmouth Medical School has maintained since 2005 writing about epistasis, computational genetics, and related topics. I've personally stumbled upon several interesting papers mentioned here that I may have otherwise missed.

Also, be sure to check out other posts in GGD's Noteworthy Blogs series!

Jason Moore's Epistasis Blog (

Monday, April 6, 2009

What's the Right Analysis, Part II

Will recently posted a set of flowcharts made by Marylyn, Jason, and Tricia, to help choose which statistical analysis to use for nearly any situation. UCLA has a very similar chart with links to Stata code, examples, and annotated output for every method they mention. Also, check out their Stata help homepage and Stata learning modules. They're excellent resources for learning how to use Stata for data analysis!

What statistical analysis should I use? (UCLA Stat Computing)

Friday, April 3, 2009

Gene Prospector

This abstract in BMC Bioinformatics was presented in our Computational Genetics Journal Club a few weeks back: "Gene Prospector: An evidence gateway for evaluating potential susceptibility genes and interacting risk factors for human diseases."

As described by the authors at CDC, Gene Prospector is a bioinformatics tool designed to sort, rank, and display information about genes in relation to human diseases, risk factors and other phenotypes.

While there seem to be several tools like this out there, this one was truly pleasant to use. A quick search for "Alzheimer" turned up some very familiar results. Genes are ranked by the evidence strength that were calculated based on the volume of different types of published literature in human genome epidemiology. The results of a query provide tons of helpful links, including links to information about each candidate gene found, exportable lists of journal articles where the results of association studies, GWASs, and meta-analyses were published, information about SNPs in these genes, and more.

Overall this seems like a great place to start when you want to compare your results to others, or when you just want more information about particular genotype-phenotype associations.

Gene Prospector
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.