Friday, August 28, 2009

Convert PLINK output to CSV

I tip my hat to Will for showing me this little command line trick. PLINK's output looks nice when you print it to the screen, but it can be a pain to load the output into excel or a MySQL database because all the fields are separated by a variable number of spaces. This little command line trick will convert a variable-space delimited PLINK output file to a comma delimited file.

You need to be on a Linux/Unix machine to do this. Here's the command. I'm looking at results from Mendelian errors here. Replace "mendel" with the results file you want to reformat, and put this all on one line.

cat mendel.txt | sed -r 's/^\s+//g' | sed -r 's/\s+/,/g' > mendel.csv

You'll have created a new file called results.hwe.csv that you can now open directly in Excel or load into a database more easily than you could with the default output.

Before:

turnersd@provolone~: cat mendel.txt
FID PAT MAT CHLD N
1089 16223073 16223062 1 149
1116 16233564 16233589 1 114
123 16230767 16230725 2 189
12 16221778 16221803 1 116
12 16221805 16221822 1 98
12 16230486 16230496 1 76
12 16231205 16232111 2 180
134 16222939 16222945 2 140
1758 16230755 16231121 2 206

After:

turnersd@provolone~: cat mendel.csv
FID,PAT,MAT,CHLD,N
1089,16223073,16223062,1,149
1116,16233564,16233589,1,114
123,16230767,16230725,2,189
12,16221778,16221803,1,116
12,16221805,16221822,1,98
12,16230486,16230496,1,76
12,16231205,16232111,2,180
134,16222939,16222945,2,140
1758,16230755,16231121,2,206


If you're interested in the details of what this is doing here you go:

First, you cat the contents of the file and pipe it to a command called sed. The thing between the single quotes in the sed command is called a regular expression, which is similar to doing a find-and-replace in MS Word. What this does is searches for the thing between the first pair of slashes and replaces it with the thing between the next two slashes. You need the -r option, and the "s" before the first and the "g" after the last slash to make it work right.

/^\s+// is the first regular expression. \s is special and it means means search for whitespace. \s+ means search for any amount of whitespace. The ^ means only look for it at the beginning of the line. Notice there is nothing between the second and third slashes, so it will replace any whitespace with nothing. This part will trim any whitespace from the beginning of the line, which is important because in the next part we're turning any remaining whitespace into a comma, so we don't want the line to start with a comma.

/\s+/,/ is the second regular expression. Again we're searching for a variable amount of whitespace but this time replacing it with a comma.

Tuesday, August 25, 2009

Great Quote from Our Own Doug Mortlock

http://www.harvardscience.harvard.edu/blog/how-dachshund-lost-its-legs

'It’s stunning to see a genetic modification like this,' developmental geneticist Douglas Mortlock of Vanderbilt University in Nashville, Tenn., says of the new study, published online July 16 in Science. 'This is the gene that makes wiener dogs short-legged.'

Thursday, August 13, 2009

Beautiful Info-graphic

While not directly related to genetics, this is an excellent example of well-designed data representation. The New York Times reports the results of a survey of average time spent on various activities through the day by different groups of people.



The graphic is essentially a stacked density plot with time (24 hours) on the X-axis. Clicking on a different group of individuals provides a very smooth transition to the new density distribution, allowing an animated visual comparison. In some ways, this animated version provides an easier comparison than showing multiple versions of the same figure. Furthermore, there is just something compelling about this figure that begs you to examine it more closely...

http://www.nytimes.com/interactive/2009/07/31/business/20080801-metrics-graphic.html?scp=3&sq=infographic&st=cse

Next-Gen Sequencing

Logan recently emailed me an article in the New York Times about single-molecule DNA sequencing and I realized I knew next to nothing about the new and emerging technology that will change the way we do association studies (that is, if we're still even trying to find genetic associations in the first place). The Wellcome Trust posted a news feature a few weeks back giving brief explanations and short videos on DNA sequencing, starting with the old Sanger method, then the second generation 454 and Illumina (Solexa) technologies. They also give a quick overview and and link to some of the 3rd generation technologies in the pipeline, including Pac Bio, Oxford Nanopore, and Complete Genomics.

Wellcome Trust feature on next-gen sequencing

Wednesday, August 12, 2009

Systems Biology Graphical Notation

The Systems Biology Graphical Notation (SBGN) project is an effort to standardize the graphical notation used in diagrams of pathways, biochemical processes, and cellular processes studied in systems biology.

SBGN defines a comprehensive set of symbols with precise semantics, together with detailed syntactic rules defining their use and how diagrams are to be interpreted. By standardizing the visual notation, SBGN can serve as a bridge between different communities in research, education, publishing, and more. The real payoff will come when researchers are as familiar with the notation as electronics engineers are familiar with the notation of circuit schematics. If researchers are saved the time and effort required to familiarize themselves with different notations, they can spend more time thinking about the biology being depicted.

You can view the project info here.

Tuesday, August 11, 2009

ggplot2 workshop at Vanderbilt

Hadley Wickham, creator of the previously mentioned R plotting system ggplot2 and author of a forthcoming book from Springer, is teaching a workshop in data visualization using R, ggplot2, and GGobi. Unfortunately this workshop conflicts with IGES and ASHG this year, but he mentioned the possibility of holding a workshop here at Vanderbilt if there is enough interest. Leave a comment or email me if you'd be interested in attending this workshop if it is held at Vanderbilt.

http://lookingatdata.com/

Saturday, August 8, 2009

PCG Journal Club Articles, 7/31

Here are citations for the articles discussed at our most recent meeting (July 31). Our next meeting is scheduled for August 14.
~Julia

Bogdanowicz W, Allen M, Branicki W, Lembring M, Gajewska M, Kupiec T. Genetic identification of putative remains of the famous astronomer Nicolaus Copernicus. Proc Natl Acad Sci U S A, 2009 Jul 7 [Epub ahead of print].

Green RC, Roberts JS, Cupples LA, Relkin NR, Whitehouse PJ, Brown T, Eckert SL, Butson M, Sadovnick AD, Quaid KA, Chen C, Cook-Deegan R, Farrer LA; REVEAL Study Group. Disclosure of APOE genotype for risk of Alzheimer's disease. N Engl J Med, 2009 Jul 16; 361(3):245-54.

International Schizophrenia Consortium, Purcell SM, Wray NR, Stone JL, Visscher PM, O'Donovan MC, Sullivan PF, Sklar P. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature, 2009 Aug 6; 460(7256):748-52.

Pevzner P, Shamir R. Computing has changed biology--biology education must catch up. Science, 2009 Jul 31; 325(5940):541-2. No abstract available.

Pomerantz MM, Ahmadiyeh N, Jia L, Herman P, Verzi MP, Doddapaneni H, Beckwith CA, Chan JA, Hills A, Davis M, Yao K, Kehoe SM, Lenz HJ, Haiman CA, Yan C, Henderson BE, Frenkel B, Barretina J, Bass A, Tabernero J, Baselga J, Regan MM, Manak JR, Shivdasani R, Coetzee GA, Freedman ML. The 8q24 cancer risk variant rs6983267 shows long-range interaction with MYC in colorectal cancer. Nat Genet, 2009 Aug; 41(8):882-4.

Rosser ZH, Balaresque P, Jobling MA. Gene conversion between the X chromosome and the male-specific region of the Y chromosome at a translocation hotspot. Am J Hum Genet, 2009 Jul; 85(1):130-4.

Tuupanen S, Turunen M, Lehtonen R, Hallikas O, Vanharanta S, Kivioja T, Björklund M, Wei G, Yan J, Niittymäki I, Mecklin JP, Järvinen H, Ristimäki A, Di-Bernardo M, East P, Carvajal-Carmona L, Houlston RS, Tomlinson I, Palin K, Ukkonen E, Karhu A, Taipale J, Aaltonen LA. The common colorectal cancer predisposition SNP rs6983267 at chromosome 8q24 confers potential to enhanced Wnt signaling. Nat Genet, 2009 Aug; 41(8):885-90. Epub 2009 Jun 28.

Wain LV, Armour JA, Tobin MD. Genomic copy number variation, human health, and disease. Lancet, 2009 Jul 25; 374(9686):340-50. Review.

Thursday, August 6, 2009

For Today’s Graduate, Just One Word: Statistics

That's the title of a good article published yesterday in the New York Times about the emergence of statistics being in huge demand in the career market, becoming "the sexy job in the next 10 years" as Google's chief economist puts it. Now I just need to find one of these don't drink and derive t-shirts...

For Today’s Graduate, Just One Word: Statistics

Wednesday, August 5, 2009

Pubget = Pubmed on Steroids

I've used this a little bit recently. Pubget indexes essentially everything that PubMed does, except you get the PDF you're looking for right away. Lots of other useful tools as well. I sent one email to the Pubget team and CC'd the biomedical library, and a few days later they've worked it out so PubGet recognizes Vanderbilt's subscriptions. If you're at Vanderbilt, go to http://vanderbilt.pubget.com/, otherwise just use http://pubget.com/, and select your institution from the dropdown list, or email them if it's not there.

The one thing I've found is that they don't index things as quickly as PubMed, so you might have a hard time finding Advance Online Publications using Pubget.
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.