Getting Genetics Done: Linux

Showing posts with label Linux. Show all posts

Tuesday, October 14, 2014

Operate on the body of a file but not the header

Sometimes you need to run some UNIX command on a file but only want to operate on the body of the file, not the header. Create a file called body somewhere in your $PATH, make it executable, and add this to it:

#!/bin/bash
IFS= read -r header
printf '%s\n' "$header"
eval $@

Now, when you need to run something but ignore the header, use the body command first. For example, we can create a simple data set with a header row and some numbers:

$ echo -e "header\n1\n5\n4\n7\n3"
header
1
5
4
7
3

We can pipe the whole thing to sort:

$ echo -e "header\n1\n5\n4\n7\n3" | sort
1
3
4
5
7
header

Oops, we don’t want the header to be included in the sort. Let’s use the body command to operate only on the body, skipping the header:

$ echo -e "header\n1\n5\n4\n7\n3" | body sort
header
1
3
4
5
7

Sure, there are other ways to solve the problem with sort, but body will solve many more problems. If you have multiple header rows, just call body multiple times.

Inspired by this post on Stack Exchange.

Wednesday, September 10, 2014

GNU datamash

GNU datamash is a command-line utility that offers simple calculations (e.g. count, sum, min, max, mean, stdev, string coalescing) as well as a rich set of statistical functions, to quickly assess information in textual input files or from a UNIX pipe. Here are a few examples:

E.g., let’s use the seq command to generate some data, and use datamash sum to add up all the values in the first column:

$ seq 5
1
2
3
4
5
$ seq 5 | datamash sum 1
15

What else? Let’s calculate the mean, 1st quartile, median, 3rd quarile, IQR, sample-standard-deviation, and p-value of Jarque-Bera test for normal distribution, using some data in file.txt:

$ cat file.txt | datamash -H mean 1 q1 1 median 1 q3 1 iqr 1 sstdev 1 jarque 1
mean(x)   q1(x)  median(x)  q3(x)   iqr(x)  sstdev(x)  jarque(x)
45.32     23     37         61.5    38.5    30.4487    8.0113e-09

Go take a look at more examples, specifically the examples using datamash to parse a GTF file. Datamash can very quickly do things like find the number of isoforms per gene by grouping (collapsing) over gene IDs and counting the number of transcripts, find the number of genes with more than 5 isoforms, find genes transcribed from multiple chromosomes, examine variability in exon counts per gene, etc.

GNU datamash

Wednesday, March 12, 2014

Software Carpentry at UVA, Redux

Software Carpentry is an international collaboration backed by Mozilla and the Sloan Foundation comprising a team of volunteers that teach computational competence and basic programming skills to scientists. In addition to a suite of online lessons, Software Carpentry also runs two-day on-site bootcamps to teach researchers skills such as using the Unix shell, programming in Python or R, using Git and GitHub for version control, managing data with SQL, and general programming best practices.

It was just over a year ago when I organized UVA's first bootcamp. Last year we reached our 50-person registration limit and had nearly 100 people on the wait list in less than two days. With support from the the Center for Public Health Genomics, the Health Sciences Library, and the Library's Research Data Services, we were able to host another two-day bootcamp earlier this week (we maxed out our registration limit this year as well). A few months ago I started Software Carpentry's training program, which teaches scientists how to teach other scientists how to program. It was my pleasure to be an instructor at this year's bootcamp along with Erik Bray and Mike Hansen.

Erik kicked off day one with a short introduction to what Software Carpentry is all about as well as setting the stage for the rest of the bootcamp -- as more fields of research become increasingly more data rich, computational skills become ever more critical.

I started the morning's lessons on using the Unix shell to get more stuff done in less time. Although there were still a few setup hiccups, things went a lot smoother this year because we provided a virtual machine with all of the necessary tools pre-installed.

We spent the rest of the morning and early afternoon going over version control with Git and collaboration using GitHub. I started out with the very basics -- the hows and whys of using version control, staging, committing, branching, merging, and conflict resolution. After lunch Erik and I did a live demonstration of two different modes of collaboration using GitHub. In the first, I pushed to a repo on GitHub and gave Erik full permissions to access and push to this repo. Here, we pushed and pulled to and from the same repo, and demonstrated what to do in case of a merge conflict. In the second demonstration we used the fork and pull model of collaboration: I created a new repo, Erik forked this, made some edits (using GitHub's web-based editor for simplicity), and submitted a pull request. After the demo, we had participants go through the same exercise -- creating their own repos with feedback about the course so far, and submitting pull requests to each other.

With the remaining hours in the afternoon, Erik introduced Python using the IPython notebook. Since most people were using the virtual machine we provided (or had already installed Anaconda), we had very minimal Python/IPython/numpy version and setup issues that may have otherwise plagued the entire bootcamp (most participants were using Windows laptops). By the end of the introductory python session, participants were using Python and NumPy to simulate logistic population growth with intermittent catastrophic population crashes, and using matplotlib to visualize the results.

Next, Mike introduced the pandas data analysis library for Python, also using an IPython notebook for teaching. In this session, participants used pandas to import and analyze year's worth of weather data from Weather Underground. Participants imported a CSV file, cleaned up the data, parsed dates written as text to create python datetime objects, used the apply function to perform bulk operations on the data, learned how to handle missing values, and synthesized many of the individual components taught in this and the previous session to partition out and perform summary operations on subsets of the data that matched particular criteria of interest (e.g., "how many days did it rain in November when the minimum temperature ranged from 20 to 32 degrees?").

Erik wrapped up the bootcamp with a session on testing code. Erik introduced the concept of testing by demonstrating the behavior of a function without revealing the source code behind it. Participants were asked to figure out what the function did by writing various tests with different input. Finally, participants worked in pairs to implement the function such that all the previously written tests would not raise any assertion errors.

Overall, our second Software Carpentry bootcamp was a qualitative success. The fact that we maxed out registration and filled a wait list within hours two years in a row demonstrates the overwhelming need for such a curriculum for scientists. Science across nearly every discipline is becoming ever more quantitative; researchers are realizing that to be successful, not only do you need to be a good scientist, a great writer, an eloquent speaker, a skilled graphic designer, a clever marketer, an efficient project manager, etc., but that you'll also need to know some programming and statistics also. This week represented the largest Software Carpentry event ever, with simultaneous bootcamps at the University of Virginia, Purdue, New York University, UC Berkeley, and the University of Washington. I can only imagine this trend will continue for the foreseeable future.

Thursday, January 30, 2014

GNU Screen

This is one of those things I picked up years ago while in graduate school that I just assumed everyone else already knew about. GNU screen is a great utility built-in to most Linux installations for remote session management. Typing 'screen' at the command line enters a new screen session. Once launched, you can start processes in the screen session, detach the session with Ctrl-a+d, then reattach at a later point and resume where you left off. See this screencast I made below:

Monday, January 13, 2014

How To Install BioPerl Without Root Privileges

I've seen this question asked and partially answered all around the web. As with anything related to Perl, I'm sure there is more than one way to do it. Here's how I do it with Perl 5.10.1 on CentOS 6.4.

First, install local::lib with bootstrapping method as described here.

Next, put this in your .bashrc so that it's executed every time you log in:

Log out then log back in, then download and install BioPerl, answering "yes" to any question asking you to download and install dependencies when necessary:

Monday, October 21, 2013

Useful Unix/Linux One-Liners for Bioinformatics

Much of the work that bioinformaticians do is munging and wrangling around massive amounts of text. While there are some "standardized" file formats (FASTQ, SAM, VCF, etc.) and some tools for manipulating them (fastx toolkit, samtools, vcftools, etc.), there are still times where knowing a little bit of Unix/Linux is extremely helpful, namely awk, sed, cut, grep, GNU parallel, and others.

This is by no means an exhaustive catalog, but I've put together a short list of examples using various Unix/Linux utilities for text manipulation, from the very basic (e.g., sum a column) to the very advanced (munge a FASTQ file and print the total number of reads, total number unique reads, percentage of unique reads, most abundant sequence, and its frequency). Most of these examples (with the exception of the SeqTK examples) use built-in utilities installed on nearly every Linux system. These examples are a combination of tactics I used everyday and examples culled from other sources listed at the top of the page.

The list is available as a README in this GitHub repo. This list is a start - I would love suggestions for other things to include. To make a suggestion, leave a comment here, or better - open an issue, or even better still - send me a pull request.

Useful one-liners for bioinformatics: https://github.com/stephenturner/oneliners

Alternatively, download a PDF here.

Friday, March 9, 2012

find | xargs ... Like a Boss

*Edit March 12* Be sure to look at the comments, especially the commentary on Hacker News - you can supercharge the find|xargs idea by using find|parallel instead.
---

Do you ever discover a trick to do something better, faster, or easier, and wish you could reclaim all the wasted time and effort before your discovery? I've had this happen many times - learning how to use vim, learning about querying relational databases, switching from EndNote to Mendeley, etc.

Now I wish I could reclaim all the time I spent writing perl code to do something that find|xargs can do much more easily. I'll lead you through an example using a problem I ran into and solved easily with find|xargs.

The problem:

I'm doing an RNA-seq experiment where I have three replicates {1,2,3} for two conditions {C,M}. I've run Tophat to align the raw reads to a reference genome, and saved the output of each run to a separate directory. Within those directories I have an alignment (accepted_hits.bam) and some other stuff.

Now, I want to assemble transcripts separately for each alignment. All the alignments are 1 directory deep in the tophat output directory. Furthermore, I want to submit a separate job for each assembly onto our cluster using qsub. In a former life I would have wrapped a perl script around this to write shell scripts that the scheduler would run, and then write a second perl script that would submit all the jobs to the scheduler.

Here's a command that will do it all in one go:

PROCS=8; find `pwd` -name "accepted_hits.bam" | xargs -i echo qsub -l ncpus=$PROCS -- `which cufflinks` -p $PROCS -o {}-cufflinks -g genes.gtf {} | sh

So let's break this down one piece at a time to see what's going on here.

First the find command. If I want to find all the files in the current directory recursively through any subdirectory, I can use the find command with the -name argument. You can see that find is searching "." (the current directory), and is returning the path (relative to ".") for each file it finds.

But if I'm submitting jobs to a scheduler, I want to use the full absolute path. You can modify find's output by telling it to search in "." but give it the full path. Even better, tell find to search in `pwd` (those are backticks, usually above the tab key. The shell will run the pwd command, and insert that output into the find command. See below.

Now, here's where xargs comes in. xargs lets you build new commands based on the standard input. In other words, you can build a command to run on each line that gets piped to xargs. Use the -i option to create a command for each individual line on the pipe, and use {} as a placeholder. The example below should help.

So here's whats going on. In the first step, I'm piping the output of find into xargs, and xargs is creating a new command that will echo "somecommand {}", where {} is a placeholder for what gets piped into xargs. So you can replace somecommand with anycommand -any -options -you -want, and echo all that back out to the STDOUT.

The second part simply pipes whatever is echo'd to the STDIN to the bash shell ( | sh). So bash will run each command it receives on the pipe. Since somecommand doesn't exist, I'm getting an error.

Below, I'm building the command to run cufflinks on each alignment, and dump that output back out to a new directory based on the pattern of the alignment (bam) file name. But since in the next step I want to parallelize this on the cluster, and the cluster won't know that cufflinks is in my path, I need to tell it where to find cufflinks. I could give it the path, but I would rather use the backtick trick I showed you above to let the shell tell the shell where cufflinks resides by using `which cufflinks`.

In the final piece, I'm adding a few more components.

First, I'm setting a shell variable called PROCS. I can access this variable later by using $PROCS. This is how many CPUs I want to allocate to each assembly job. The ";" separates two commands. Instead of using xargs to build a cufflinks command to run at the shell directly, I'm using xargs to build a qsub command that will qsub these jobs to the cluster. To qsub something from the command line, the syntax is

qsub -- myprog -o -p -t arg1 arg2

Where are the PBS directives to qsub (like the number of processors I want, the amount of RAM I require, etc). myprog is the program you're running. -o, -p, -t are options to myprog, and arg1 and arg2 are the arguments to myprog. The two hyphens are required between the qsub options and the commands you actually want to run.

So in the command above, I'm using xargs to build up a qsub command, substituting $PROCS (8 here) for the number of CPUs I require, calling cufflinks from the full path using the backtick trick, telling cufflinks that I want $PROCS (8) processors, use genes.gtf for reference annotation based transcript (RABT) assembly, and run all that cufflinks stuff on the supplied alignment.

In summary, this one-liner will submit six 8-processor jobs. It sure beats writing perl scripts. There are a zillion other uses for piping together find with xargs. If you have a useful one-liner, please share it in the comments. Happy piping!

Thursday, April 21, 2011

How To Get Allele Frequencies and Create a PED file from 1000 Genomes Data

I recently analyzed some next-generation sequencing data, and I first wanted to compare the frequencies in my samples to those in the 1000 Genomes Project. It turns out this is much easier that I thought, as long as you're a little comfortable with the Linux command line.

First, you'll need a Linux system, and two utilities: tabix and vcftools.

I'm virtualizing an Ubuntu Linux system in Virtualbox on my Windows 7 machine. I had a little trouble compiling vcftools on my Ubuntu system out of the box. Before trying to compile tabix and vcftools I'd recommend installing the GNU C++ compiler and another development version of a compression library, zlib1g-dev. This is easy in Ubuntu. Just enter these commands at the terminal:

sudo apt-get install g++

sudo apt-get install zlib1g-dev

First, download tabix. I'm giving you the direct link to the most current version as of this writing, but you might go to the respective sourceforge pages to get the most recent version yourself. Use tar to unpack the download, go into the unzipped directory, and type "make" to compile the executable.

wget http://sourceforge.net/projects/samtools/files/tabix/tabix-0.2.3.tar.bz2

tar -jxvf tabix-0.2.3.tar.bz2

cd tabix-0.2.3/ 

make

Now do the same thing for vcf tools:

wget http://sourceforge.net/projects/vcftools/files/vcftools_v0.1.4a.tar.gz

tar -zxvf vcftools_v0.1.4a.tar.gz 

cd vcftools_0.1.4a/

make

The vcftools binary will be in the cpp directory. Copy both the tabix and vcftools executables to wherever you want to run your analysis.

Let's say that you wanted to pull all the 1000 genomes data from the CETP gene on chromosome 16, compute allele frequencies, and drop a linkage format PED file so you can look at linkage disequilibrium using Haploview.

First, use tabix to hit the 1000 genomes FTP site, pulling data from the 20080804 release for the CETP region (chr16:56,995,835-57,017,756), and save that output to a file called genotypes.vcf. Because tabix doesn't download the entire 1000 Genomes data and pulls only the sections you need, this is extremely fast. This should take around a minute, depending on your web connection and CPU speeds.

./tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 16:56995835-57017756 > genotypes.vcf

Not too difficult, right? Now use vcftools (which works a lot like plink) to compute allele frequencies. This should take less than one second.

./vcftools --vcf genotypes.vcf --freq --out allelefrequencies

Finally, use vcftools to create a linkage format PED and MAP file that you can use in PLINK or Haploview. This took about 30 seconds for me.

./vcftools --vcf genotypes.vcf --plink --out plinkformat

That's it. It looks like you can also dig around in the supporting directory on the FTP site and pull out genotypes for specific ethnic groups as well (EUR, AFR, and ASN HapMap populations).

Monday, April 11, 2011

Monday Links: 23andMe, RStudio, PacBio+Galaxy, Data Science One-Liners, Post-Linkage RFA, SSH

Lately I haven't written as many full length posts as usual, but here's a quick roundup of a few links I've shared on Twitter (@genetics_blog) over the last week:

First, 23andMe is having a big DNA Day Sale ($108) for the kit + 1 year of their personal genome subscription service https://www.23andme.com/.

Previously mentioned R IDE RStudio released a new beta (version 0.93) that includes several new features and bugfixes. Among the new features, my favorites include the ability to customize the pane layout, and the included {manipulate} package that facilitates interactive plotting: http://bit.ly/gR9ir8.

The Pacific Biosciences open-source Analysis suite is out: http://j.mp/f9QP0F (genomeweb subscribers only), and you can watch a video tutorial (free) on Galaxy-PacBio integration here: http://j.mp/i1FzEm.

If you already have existing genotype/phenotype data on families you've collected, you might check out this new RFA from NHLBI and NHGRI: “Life After Linkage: the Future of Family Studies” http://1.usa.gov/dR9gHb: This FOA issued by NHLBI and NHGRI solicits applications which integrate novel molecular data with existing genotype and phenotype data in families to identify and characterize genes influencing complex disorders. Applications for this program should have previously examined families with some genotyping and phenotypes available. Proposals may utilize focused ascertainment from families, for example, using risk profile, extreme phenotypes or families contributing heavily to genomic signals. The program may support the addition of targeted or whole genome sequencing, exon sequencing, copy number variants (CNVs), expression profiling, metabolomics, epigenomics, and/or novel biomarkers. This may require additional sample collection and/or reconsent. Applications should include a strong analytic component focused on bioinformatics and integration of multiple data types.

Data Science Hand Tools http://oreil.ly/hyhb6Z. This piece includes nice demonstrations of using grep, awk, cut, colrm, sort, uniq, wc, find, xargs, and other simple and flexible UNIX one-liners to do some serious data munging.

A new paper from Lucia Hindorff, Elizabeth Gillanders, and Teri Manolio at NHGRI: Genetic architecture of cancer & complex diseases: Lessons learned & future directions http://1.usa.gov/h28J2A

Bioinformatics special issue includes acompilation of all Bioinformatics papers on Next-Generation Sequencing http://bit.ly/DcfAm

And finally, simple instructions for setting up password-free SSH login http://linuxproblem.org/art_9.html

Wednesday, April 21, 2010

Unix and Perl for Biologists

This looks like a must-read for anyone starting out in computational biology without extensive experience at the command line. The 135-page document linked at the bottom of the Google Group page looks like an excellent primer with lots of examples that could probably be completed in a day or two, and provides a great start for working in a Linux/Unix environment and programming with Perl. This started out as a graduate student course at UC Davis, and is now freely available for anyone who wants to learn Unix and Perl. Also, don't forget about the printable linux command line cheat sheet I posted here long ago.

Google Groups: Unix and Perl for Biologists

Tuesday, February 23, 2010

Happy Birthday GGD! The 10 Most Popular Posts Since GGD's Launch

The first post on Getting Genetics Done was one year ago today. To celebrate, here are the top 10 most viewed posts since GGD launched last year. Incidentally, nine of the ten are tutorials on how to do something in R. Thanks to all the readers and all the commenters for sharing your thoughts!

ggplot2 Tutorial: Scatterplots in a Series of Small Multiples

GWAS Manhattan plots and QQ plots using ggplot2 in R

Hierarchical Clustering in R (by Will Bush)

Comparison of plots using Stata, R base, R lattice, and R ggplot2: Histograms

Merge data from different files using R

QQ plots of p-values in R using ggplot2

PDF tutorial from R course (Introduction to R)

Visualizing sample relatedness in a GWAS using PLINK and R

Split, apply, and combine in R using PLYR

Linux Command Line Cheat Sheet

Monday, December 7, 2009

Use PuTTY and XMing to see Linux graphics via SSH on your Windows computer

Do you use SSH to connect to a remote Linux machine from your local Windows computer? Ever needed to run a program on that Linux machine that displays graphical output, or uses a GUI? I was in this position last week trying to make figures using ggplot2 in R of results from an analysis of GWAS data which required using a 64-bit Linux machine with more RAM than my 32-bit windows machine can see.

You try plotting something in R on a Linux machine in an SSH session you'll get this nasty error message:

Error in function (display = "", width, height, pointsize, gamma, bg,: 

X11 I/O error while opening X11 connection to 'localhost:10.0'

Turns out there's a very easy way to see graphical output over your SSH terminal. First, if you're not already using PuTTY for SSH, download putty.exe from here. Next, download, install, and run Xming. While Xming is running in your system tray, log into the Linux server as you normally would using PuTTY. Then type this command at the terminal to log into the linux server of your choice (here, pepperjack), with the -X (uppercase) to enable X11 forwarding.

ssh -X pepperjack.mc.vanderbilt.edu

If all goes well you should now be able to use programs that utilize graphical output or interfaces, which are running on the remote Linux machine rather than your local windows computer.

Xming - PC X Server

Xming download link on SourceForge

Wednesday, September 9, 2009

Sync your home directories on ACCRE and the local Linux servers (a.k.a. "the cheeses")

Vanderbilt ACCRE users with PCs only...

If you use ACCRE to run multi-processor jobs you'll be glad to know that they now allow you to map your home directory to your local desktop using Samba (so you can access your files through My Computer as you normally would with local files). Just submit a help request on their website and they'll get you set up.

Now if you have both your ACCRE home and your home on the cheeses mapped, you can easily sync the files between the two. Download Microsoft's free SyncToy to do the job. It's pretty dead simple to set up, and one click will synchronize files between the two servers.

I didn't want to synchronize everything, so I set it up to only sync directories that contain perl scripts and other programs that I commonly use on both machines. SyncToy also seems pretty useful for backing up your files too.

Microsoft SyncToy

Ask ACCRE to let you map your home

Tuesday, September 8, 2009

Get the full path to a file in Linux / Unix

In the last post I showed you how to point to a file in windows and get the full path copied to your clipboard. I wanted to come up with something similar for a Linux environment. This is helpful on Vampire/ACCRE because you have to fully qualify the path to every file you use when you submit jobs with PBS scripts. So I wrote a little perl script:

#!/usr/bin/perl
chomp($pwd=`pwd`);
print "$pwd\n" if @ARGV==0;
foreach (@ARGV) {print "$pwd/$_\n";}

You can copy this from me, just put it in your bin directory, like this:

cp /home/turnersd/bin/path ~/bin

Make it executable, like this:

chmod +x ~/bin/path

Here it is in action. Let's say I wanted to print out the full path to all the .txt files in the current directory. Call the program with arguments as the files you want to print the path to:

[turnersd@vmps21 replicated]$ ls
parseitnow.pbs
parsing_program.pl
replic.txt
tophits.txt

[turnersd@vmps21 replicated]$ path *.txt
/projects/HDL/epistasis/replicated/replic.txt
/projects/HDL/epistasis/replicated/tophits.txt

Sure, it's only a little bit quicker than typing pwd, copying that, then spelling out the filenames. But if you have long filenames or lots of filenames you want to copy, this should get things done faster. Enjoy.

Friday, August 28, 2009

Convert PLINK output to CSV

I tip my hat to Will for showing me this little command line trick. PLINK's output looks nice when you print it to the screen, but it can be a pain to load the output into excel or a MySQL database because all the fields are separated by a variable number of spaces. This little command line trick will convert a variable-space delimited PLINK output file to a comma delimited file.

You need to be on a Linux/Unix machine to do this. Here's the command. I'm looking at results from Mendelian errors here. Replace "mendel" with the results file you want to reformat, and put this all on one line.

cat mendel.txt | sed -r 's/^\s+//g' | sed -r 's/\s+/,/g' > mendel.csv

You'll have created a new file called results.hwe.csv that you can now open directly in Excel or load into a database more easily than you could with the default output.

Before:


turnersd@provolone~: cat mendel.txt
FID        PAT        MAT   CHLD    N
1089   16223073   16223062      1  149
1116   16233564   16233589      1  114
123   16230767   16230725      2  189
12   16221778   16221803      1  116
12   16221805   16221822      1   98
12   16230486   16230496      1   76
12   16231205   16232111      2  180
134   16222939   16222945      2  140
1758   16230755   16231121      2  206

After:


turnersd@provolone~: cat mendel.csv
FID,PAT,MAT,CHLD,N
1089,16223073,16223062,1,149
1116,16233564,16233589,1,114
123,16230767,16230725,2,189
12,16221778,16221803,1,116
12,16221805,16221822,1,98
12,16230486,16230496,1,76
12,16231205,16232111,2,180
134,16222939,16222945,2,140
1758,16230755,16231121,2,206

If you're interested in the details of what this is doing here you go:

First, you cat the contents of the file and pipe it to a command called sed. The thing between the single quotes in the sed command is called a regular expression, which is similar to doing a find-and-replace in MS Word. What this does is searches for the thing between the first pair of slashes and replaces it with the thing between the next two slashes. You need the -r option, and the "s" before the first and the "g" after the last slash to make it work right.

/^\s+// is the first regular expression. \s is special and it means means search for whitespace. \s+ means search for any amount of whitespace. The ^ means only look for it at the beginning of the line. Notice there is nothing between the second and third slashes, so it will replace any whitespace with nothing. This part will trim any whitespace from the beginning of the line, which is important because in the next part we're turning any remaining whitespace into a comma, so we don't want the line to start with a comma.

/\s+/,/ is the second regular expression. Again we're searching for a variable amount of whitespace but this time replacing it with a comma.

Monday, May 18, 2009

Linux tip: history and !

Ever find yourself trying to remember a series of steps you recently did in Linux? Try typing the command "history" at the command line (without the quotes). You'll see a long list of your most recently used commands. It looks like this:

1018 ls
1019 cd
1020 cd /scratch/turnersd/
1021 ls
1022 cd
1023 grep -P "\s(10|9|8)\s" /scratch/turnersd/alz/parsed.txt | awk '{print $1"\n"$2}' | sort | uniq | perl -pi -e 's/RS/rs/g'
1024 history

Which brings me to my second tip, ! commands. Notice that when you type history the commands are numbered. 1023 in particular was a long string of commands I wouldn't want to retype over and over. Fortunately Linux lets me repeat that command any time I want just by typing an exclamation point followed by the number, like this — !1023 — at the command line, which does the same thing as typing it in the long way.

Thursday, April 16, 2009

Linux Command Du Jour: time

I briefly mentioned "time" in the previously posted Linux command line cheat sheet, but I can't overstate its utility especially for ACCRE/Vampire users. The time command does exactly what it sounds like: it times exactly how long it takes to run anything at the command line. And it couldn't be easier to use. Just type the word "time" preceding what you would normally type in at the command line.

For example, instead of:

> ./sMDR run.cfg

Type this in instead:

> time ./sMDR run.cfg

The MDR analysis (or whatever other command) runs just as before, but now there will be a couple lines displayed telling you how long the command took to execute. (Depending on whether you're doing this on the local cheeses or on Vampire the output will look slightly different, but you want to look at the "real" time). If you don't have anything to run right now, fire up a terminal window and just try "time ls". Obviously it won't take long to run, but you can see how the time command works.

Where this becomes very useful is if you are preparing to run analyses on Vampire, where you must estimate the time it takes to execute a command or analysis. You can time smaller versions of a large analysis for a better estimate of how long something will take. For more information, type "man time" at the command line.

Thursday, March 26, 2009

Ditch Excel for AWK!

How often have you needed to extract a certain column from a file, or strip out only a few individuals from a ped file? This is often done with Excel, but that requires loading the data, manipulating it, then saving it back into a format that is acceptable, which may require converting tabs to spaces or other aggravating things. In many circumstances, a powerful linux command, AWK, can be used to accomplish the same thing with far fewer steps (at least once you get the hang of it).

AWK is almost like its own language, and in fact portions of PERL are based on AWK. Let's say you have a file called text.txt and you want to find all lines of that file that contain the word "the":

> awk '/The/' text.txt

Or you'd like to see all lines that start with "rs":

> awk '/^rs/' text.txt

Or perhaps most usefully, you want to strip the top 5 lines out of a file:

> awk 'NR > 5' text.txt

This just scratches the surface of course... for a good tutorial with examples, check out this site:
http://www.vectorsite.net/tsawk_1.html#m2

I'll also look into setting up a post with AWK snippets for common useful procedures...

Will

Tuesday, March 24, 2009

Write Your First Perl Script

And it will do way more than display "Hello, world!" to the screen. An anonymous commenter on one of our Linux posts recently asked how to write scripts that will automate the same analysis on multiple files. While there are potentially several ways to do this, perl will almost always get the job done. Here I'll pose a situation you may run into where perl may help, and we'll break down what this simple perl script does, line by line.

Let's say you want to run something like plink or MDR, or any other program that you have to call at the command line with a configuration or batch file that contains the details of the analysis. For this simple case, let's pretend we're running MDR on a dataset we've stratified 3 ways by race. We have three MDR config files: afr-am.cfg, caucsian.cfg, and hispanic.cfg that will each run MDR on the respective dataset. Now, we could manually run MDR three separate times, and in fact, in this scenario that may be easier. But when you need to run dozens, or maybe hundreds of analyses, you'll need a way to automate things. Check out this perl script, and I'll explain what it's doing below. Fire up something like nano or pico, copy/paste this, and save the file as "runMDR.pl"

foreach $current_cfg (@ARGV)
{
# This will run sMDR
`./sMDR $current_cfg`;
}
# Hooray, we're done!

Now, if you call this script from the command line like this, giving the config files you want to run as arguments to the script, it will run sMDR on all three datasets, one after the other:

> perl runMDR.pl afr-am.cfg caucasian.cfg hispanic.cfg

You could also use the askerisk to pass everything that ends with ".cfg" as an argument to the script:

> perl runMDR.pl *.cfg

Okay, let's break this down, step by step.

First, some syntax. Perl ignores everything on a line after the # sign, so this way you can comment your code, so you can remember what it does later. The little ` things on the 4th line are backticks. Those are usually above your tab key on your keyboard. And that semicolon is important.
@ARGV is an array that contains the arguments that you pass to the program (the MDR config files here), and $current_config is a variable that assumes each element in @ARGV, one at a time.
Each time $current_config assumes a new identity, perl will execute the code between the curly braces. So the first time, perl will execute `./sMDR afr-am.cfg`; The stuff between the backticks is executed exactly as if you were typing it into the shell yourself. Here, I'm assuming you have sMDR and afr-am.cfg in the current directory.
Once perl executes the block of code between the braces for each element of @ARGV, it quits, and now you'll have results for all three analyses.

A few final thoughts... If the stuff you're automating is going to take a while to complete, you may consider checking out Greg's previous tutorial on screen. Next, if whatever program you're running over and over again displays output to the screen, you'll have to add an extra line to see that output yourself, or write that output to a file. Also, don't forget your comments! Perl can be quick to write but difficult to understand later on, so comment your scripts well. Finally, if you need more help and you can't find it here or here, many of the folks on this hall have used perl for some time, so ask around!

Monday, March 23, 2009

Screen How-to

`screen` is one of the most useful system tools I've ever used. It allows you to begin a process and then detach from the process to let it continue to run in the background. You can later re-attach to the screen to check progress, even from a remote location.

To install screen on a Debian-based system, you would do:

apt-get install screen

or for a RedHat-based system:

yum install screen

Some usage examples:

To start a job inside a screen, a Ruby script for example, you would issue a command something like this:

> screen my_script.rb

The job begins to run in exactly the same manner if you had not used a screen.

To detach from the screen you would use these keystrokes:

> Ctrl + a, Ctrl + d.

After this the Ruby script is still running.

To reconnect to the running screen you type:

> screen -dr

The 'd' means detach and the 'r' means re-attach to the current terminal.

If you have lots of screens running, you'll be prompted to specify which screen you want to re-attach to your terminal. For example:

> screen -dr
There are several suitable screens on:
5190.pts-0.pluto (03/18/2009 01:36:22 PM) (Detached)
5134.pts-0.pluto (03/18/2009 01:14:08 PM) (Detached)

Type "screen [-d] -r [pid.]tty.host" to resume one of them.

At this point you have to specify like this:

screen -dr 5190.pts-0.pluto

or

screen -dr 5134.pts-0.pluto

I find screen most useful for starting long-running jobs on remote servers. I can start the job, then log out and let it run without any worries of what my local system is doing. I can reboot or log off without any issues. Later I can re-attach the screen to my terminal to check progress as required. When the job is done, so is the screen, they are self-cleaning. :)

More info can be found here:

http://www.linuxmanpages.com/man1/screen.1.php

This blog has moved!