Tuesday, March 31, 2009

Noteworthy Blog: Chun Li's Course Blog

For the second installment in our Noteworthy Blog series, take a look at Chun Li's biostatistics course blog. Several years ago, CHGR faculty member Chun Li taught a class in biostatistics, maintaining this blog over the duration of the course. Although it hasn't been updated since the course ended, it's filled with masterful guidance on data analysis issues and insightful commentary on important topics and common misconceptions about statistics.

Chun's Course Blog

Thursday, March 26, 2009

Ditch Excel for AWK!

How often have you needed to extract a certain column from a file, or strip out only a few individuals from a ped file? This is often done with Excel, but that requires loading the data, manipulating it, then saving it back into a format that is acceptable, which may require converting tabs to spaces or other aggravating things. In many circumstances, a powerful linux command, AWK, can be used to accomplish the same thing with far fewer steps (at least once you get the hang of it).

AWK is almost like its own language, and in fact portions of PERL are based on AWK. Let's say you have a file called text.txt and you want to find all lines of that file that contain the word "the":

> awk '/The/' text.txt

Or you'd like to see all lines that start with "rs":

> awk '/^rs/' text.txt

Or perhaps most usefully, you want to strip the top 5 lines out of a file:

> awk 'NR > 5' text.txt

This just scratches the surface of course... for a good tutorial with examples, check out this site:

I'll also look into setting up a post with AWK snippets for common useful procedures...


Tuesday, March 24, 2009

Write Your First Perl Script

And it will do way more than display "Hello, world!" to the screen. An anonymous commenter on one of our Linux posts recently asked how to write scripts that will automate the same analysis on multiple files. While there are potentially several ways to do this, perl will almost always get the job done. Here I'll pose a situation you may run into where perl may help, and we'll break down what this simple perl script does, line by line.

Let's say you want to run something like plink or MDR, or any other program that you have to call at the command line with a configuration or batch file that contains the details of the analysis. For this simple case, let's pretend we're running MDR on a dataset we've stratified 3 ways by race. We have three MDR config files: afr-am.cfg, caucsian.cfg, and hispanic.cfg that will each run MDR on the respective dataset. Now, we could manually run MDR three separate times, and in fact, in this scenario that may be easier. But when you need to run dozens, or maybe hundreds of analyses, you'll need a way to automate things. Check out this perl script, and I'll explain what it's doing below. Fire up something like nano or pico, copy/paste this, and save the file as "runMDR.pl"

foreach $current_cfg (@ARGV)
# This will run sMDR
`./sMDR $current_cfg`;
# Hooray, we're done!

Now, if you call this script from the command line like this, giving the config files you want to run as arguments to the script, it will run sMDR on all three datasets, one after the other:

> perl runMDR.pl afr-am.cfg caucasian.cfg hispanic.cfg

You could also use the askerisk to pass everything that ends with ".cfg" as an argument to the script:

> perl runMDR.pl *.cfg

Okay, let's break this down, step by step.
  1. First, some syntax. Perl ignores everything on a line after the # sign, so this way you can comment your code, so you can remember what it does later. The little ` things on the 4th line are backticks. Those are usually above your tab key on your keyboard. And that semicolon is important.
  2. @ARGV is an array that contains the arguments that you pass to the program (the MDR config files here), and $current_config is a variable that assumes each element in @ARGV, one at a time.
  3. Each time $current_config assumes a new identity, perl will execute the code between the curly braces. So the first time, perl will execute `./sMDR afr-am.cfg`; The stuff between the backticks is executed exactly as if you were typing it into the shell yourself. Here, I'm assuming you have sMDR and afr-am.cfg in the current directory.
  4. Once perl executes the block of code between the braces for each element of @ARGV, it quits, and now you'll have results for all three analyses.
A few final thoughts... If the stuff you're automating is going to take a while to complete, you may consider checking out Greg's previous tutorial on screen. Next, if whatever program you're running over and over again displays output to the screen, you'll have to add an extra line to see that output yourself, or write that output to a file. Also, don't forget your comments! Perl can be quick to write but difficult to understand later on, so comment your scripts well. Finally, if you need more help and you can't find it here or here, many of the folks on this hall have used perl for some time, so ask around!

Monday, March 23, 2009

Screen How-to

`screen` is one of the most useful system tools I've ever used. It allows you to begin a process and then detach from the process to let it continue to run in the background. You can later re-attach to the screen to check progress, even from a remote location.

To install screen on a Debian-based system, you would do:

apt-get install screen

or for a RedHat-based system:

yum install screen

Some usage examples:

To start a job inside a screen, a Ruby script for example, you would issue a command something like this:

> screen my_script.rb

The job begins to run in exactly the same manner if you had not used a screen.

To detach from the screen you would use these keystrokes:

> Ctrl + a, Ctrl + d.

After this the Ruby script is still running.

To reconnect to the running screen you type:

> screen -dr

The 'd' means detach and the 'r' means re-attach to the current terminal.

If you have lots of screens running, you'll be prompted to specify which screen you want to re-attach to your terminal. For example:

> screen -dr
There are several suitable screens on:
5190.pts-0.pluto (03/18/2009 01:36:22 PM) (Detached)
5134.pts-0.pluto (03/18/2009 01:14:08 PM) (Detached)

Type "screen [-d] -r [pid.]tty.host" to resume one of them.

At this point you have to specify like this:

screen -dr 5190.pts-0.pluto


screen -dr 5134.pts-0.pluto

I find screen most useful for starting long-running jobs on remote servers. I can start the job, then log out and let it run without any worries of what my local system is doing. I can reboot or log off without any issues. Later I can re-attach the screen to my terminal to check progress as required. When the job is done, so is the screen, they are self-cleaning. :)

More info can be found here:


Noteworthy Blog: The Spittoon

For those of you who attend our computational genetics journal club every other week, you've all heard about this. Say what you will about the "consumer genetics" enterprise, 23andMe maintains an excellent blog. In their "SNPwatch" category, The Spittoon surveys and summarizes the latest findings in human genetics research before they hit the press. About 50% of their content comes from Nature Genetics advance online, and the rest from a smattering of other journals. They usually offer a one-page summary of the research findings detailing the associated SNP's rs-number, risk allele, odds ratio estimate, and the sample size used.

The Spitoon

Monday, March 16, 2009

What's the Right Analysis for this Data?

If you've ever had trouble getting started doing a data analysis, you are certainly not alone. Should I run an ANOVA, MANOVA, ANCOVA, or MANCOVA? Should that have been a McNemar's test, a Kruskal-Wallis, or a Mann-Whitney U?

To at least pin down the statistical test you should run, consult these excellent flowcharts by Marylyn Ritchie, Jason Moore, and Tricia Thornton-Wells. They have been stuck to my wall for years, and whenever I have a new type of data to look at, I consult them to get headed in the right direction!


UPDATE 2009-04-06: Be sure to check out the follow up to this post, What's the Right Analysis part II, for a link to example applications of these methods with the appropriate Stata code.

Sunday, March 15, 2009

SNPper: quickly get info about a list of SNPs

Many of you have used this before, but for those who haven't, SNPper is a convenient little web application for quickly annotating results. So you've done your association analysis and have a list of rs-numbers you'd like to quickly get more information about. You could always look these up on the UCSC genome browser or in dbSNP, but if you have hundreds and you want some quick info such as what genes they're in, what genomic position they occupy, whether they're exonic, nonsynonymous, in a promoter region, etc., try SNPper. You can register for free, or use their guest access which doesn't require registration. Paste a list of rs-numbers into their box, and click find. It also gives you some nice options for exporting the information to a file. They also offer other bioinformatics tools, such as gene ontology browsing and transcription factor binding site identification.


Tuesday, March 10, 2009

Linux Tutorial

Last week I posted a one-page reference guide that gives a short description of the Linux commands I most commonly use. To accompany this, here is a detailed walkthrough filled with examples that will introduce any beginner to the basics of using Linux in just a few hours. Many thanks to Eric Torstenson in the Ritchie lab for putting this together.

Linux Tutorial (PDF)

Friday, March 6, 2009

Linux Command Line Cheat Sheet

Whenever we have new students rotating through our lab who've never used Linux I always end up scrounging around the world wide series-of-tubes only to find some command line reference that's not really useful for students. So I made my own one-page reference guide that gives a basic description of 99% of the commands most of you will use. Feel free to distribute.

Linux / Unix Command Line Cheat Sheet (PDF)

Monday, March 2, 2009

Pubmed Searches as an RSS feed

As Stephen nicely posted earlier, RSS feeds are a very powerful way to keep up with the literature -- they "push" the information to you. In addition to subscribing to individual journals, you can subscribe to a PubMed search! This will let you keep up with ALL PubMed indexed journals.

To subscribe to a PubMed search, first go to www.pubmed.org and enter your search terms. Once you retrieve a search listing, you'll see a bar that says

Display Summary Show 20 Sort By Send to

The SEND TO drop down box will allow you to select an RSS Feed. Once you select this, you'll be taken to a page with a button that says "Create Feed". When you click this, you'll get a new page with a little orange XML button. Click it and your browser will give you the option to subscribe to the feed. Once you subscribe, there are lots of ways to read RSS Feeds, which we'll probably get to in another post.

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.