Friday, March 9, 2012

find | xargs ... Like a Boss

*Edit March 12* Be sure to look at the comments, especially the commentary on Hacker News - you can supercharge the find|xargs idea by using find|parallel instead.
---

Do you ever discover a trick to do something better, faster, or easier, and wish you could reclaim all the wasted time and effort before your discovery? I've had this happen many times - learning how to use vim, learning about querying relational databases, switching from EndNote to Mendeley, etc.

Now I wish I could reclaim all the time I spent writing perl code to do something that find|xargs can do much more easily. I'll lead you through an example using a problem I ran into and solved easily with find|xargs.

The problem:

I'm doing an RNA-seq experiment where I have three replicates {1,2,3} for two conditions {C,M}. I've run Tophat to align the raw reads to a reference genome, and saved the output of each run to a separate directory. Within those directories I have an alignment (accepted_hits.bam) and some other stuff.



Now, I want to assemble transcripts separately for each alignment. All the alignments are 1 directory deep in the tophat output directory. Furthermore, I want to submit a separate job for each assembly onto our cluster using qsub. In a former life I would have wrapped a perl script around this to write shell scripts that the scheduler would run, and then write a second perl script that would submit all the jobs to the scheduler.

Here's a command that will do it all in one go:

PROCS=8; find `pwd` -name "accepted_hits.bam" | xargs -i echo qsub -l ncpus=$PROCS -- `which cufflinks` -p $PROCS -o {}-cufflinks -g genes.gtf {} | sh

So let's break this down one piece at a time to see what's going on here.

First the find command. If I want to find all the files in the current directory recursively through any subdirectory, I can use the find command with the -name argument. You can see that find is searching "." (the current directory), and is returning the path (relative to ".") for each file it finds.



But if I'm submitting jobs to a scheduler, I want to use the full absolute path. You can modify find's output by telling it to search in "." but give it the full path. Even better, tell find to search in `pwd` (those are backticks, usually above the tab key. The shell will run the pwd command, and insert that output into the find command. See below.



Now, here's where xargs comes in. xargs lets you build new commands based on the standard input. In other words, you can build a command to run on each line that gets piped to xargs. Use the -i option to create a command for each individual line on the pipe, and use {} as a placeholder. The example below should help.



So here's whats going on. In the first step, I'm piping the output of find into xargs, and xargs is creating a new command that will echo "somecommand {}", where {} is a placeholder for what gets piped into xargs. So you can replace somecommand with anycommand -any -options -you -want, and echo all that back out to the STDOUT.

The second part simply pipes whatever is echo'd to the STDIN to the bash shell ( | sh). So bash will run each command it receives on the pipe. Since somecommand doesn't exist, I'm getting an error.

Below, I'm building the command to run cufflinks on each alignment, and dump that output back out to a new directory based on the pattern of the alignment (bam) file name. But since in the next step I want to parallelize this on the cluster, and the cluster won't know that cufflinks is in my path, I need to tell it where to find cufflinks. I could give it the path, but I would rather use the backtick trick I showed you above to let the shell tell the shell where cufflinks resides by using `which cufflinks`.



In the final piece, I'm adding a few more components.



First, I'm setting a shell variable called PROCS. I can access this variable later by using $PROCS. This is how many CPUs I want to allocate to each assembly job. The ";" separates two commands. Instead of using xargs to build a cufflinks command to run at the shell directly, I'm using xargs to build a qsub command that will qsub these jobs to the cluster. To qsub something from the command line, the syntax is

qsub -- myprog -o -p -t arg1 arg2

Where are the PBS directives to qsub (like the number of processors I want, the amount of RAM I require, etc). myprog is the program you're running. -o, -p, -t are options to myprog, and arg1 and arg2 are the arguments to myprog. The two hyphens are required between the qsub options and the commands you actually want to run.

So in the command above, I'm using xargs to build up a qsub command, substituting $PROCS (8 here) for the number of CPUs I require, calling cufflinks from the full path using the backtick trick, telling cufflinks that I want $PROCS (8) processors, use genes.gtf for reference annotation based transcript (RABT) assembly, and run all that cufflinks stuff on the supplied alignment.

In summary, this one-liner will submit six 8-processor jobs. It sure beats writing perl scripts. There are a zillion other uses for piping together find with xargs. If you have a useful one-liner, please share it in the comments. Happy piping!

9 comments:

  1. If you like the combination of find and xargs take a look at replacing xargs with GNU parallel: http://www.gnu.org/software/parallel/man.html

    ReplyDelete
  2. Excellent post and I'm sure I will use xargs (and/or parallel).

    A while back I wrote a post on how data managers can use Unix tools for basic data processing. it covered cat, head, tail, wc, cut, paste, grep, sort, sed, curl, xmllint.

    http://mazamascience.com/WorkingWithData/?p=213

    It's quite amazing what you can do in a single line with standard Unix tools.

    ReplyDelete
  3. HN comments: http://news.ycombinator.com/item?id=3690781

    ReplyDelete
  4. Thanks for the comments everyone, especially those from Hacker News. GNU parallel is excellent, and also let's you do a dry run, e.g.

    find . -name "pattern" | parallel --dry-run command option {} {}.out

    ReplyDelete
  5. Your original case will also, I think, admit of a solution using the -exec option to find instead of xargs.

    ReplyDelete
  6. Thanks Stephen, extremely useful and simple !

    ReplyDelete
  7. I do this sort of thing a lot lately, just yesterday using tophat, in fact. For this task I would use brace expansion and a for loop instead of xargs and find. This makes the command history easier to decipher, if need be.


    for n in $(pwd)/cond{C,M}_rep{1..3}-tophat/accepted_hits.bam; do
    qsub -l ncpus=8 -- $(which cufflinks) -p 8 -o ${n}-cufflinks -g genes.gtf $n
    done

    Or, another cool thing about parallel is that you can give it sets of arguments, and it will create all combinations:

    parallel qsub -l ncpus=8 -- $(which cufflinks) -p 8 -o $(pwd)/cond{1}_rep{2}-tophat/accepted_hits.bam ::: C M ::: 1 2 3

    ReplyDelete
  8. I was wondering if we can use this method for tophat as well. For pair end reads, how can we give it two inputs then?

    ReplyDelete

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.