Friday, August 5, 2011

Friday Links: R, OpenHelix Bioinformatics Tips, 23andMe, Perl, Python, Next-Gen Sequencing

I haven't posted much here recently, but here is a roundup of a few of the links I've shared on Twitter (@genetics_blog) over the last two weeks.

Here is a nice tutorial on accessing high-throughput public data (from NCBI) using R and Bioconductor., a startup that allows you to run high-performance computing (HPC) applications in the cloud, now supports the previously mentioned R IDE, RStudio.

23andMe announced a project to enroll 10,000 African-Americans for research by giving participants their personal genome service for free. You can read about it here at 23andMe or here at Genetic Future.

Speaking of 23andMe, they emailed me a coupon code (8WR9U9) for getting $50 off their personal genome service, making it $49 instead of $99. Not sure how long it will last.

I previously took a poll which showed that most of you use Mendeley to manage your references. Mendeley recently released version 1.0, which includes some nice features like duplicate detection, better library organization (subfolders!), and a better file organization tool. You can download it here.

An interesting blog post by Michael Barton on how training and experience in bioinformatics leads to a wide set of transferable skills.

Dienekes releases a free DIY admixture program to analyze genomic ancestry.

A few tips from OpenHelix: the new SIB Bioinformatics Resource Portal, and testing correlation between SNPs and gene expression using SNPexp.

A nice animation describing a Circos plot from PacBio's E. coli paper in NEJM.

The Court of Appeals for the Federal Circuit reversed the lower court's invalidation of Myriad Genetics' patents on BRCA1/2, reinstating most of the claims in full force. Thoughtful analysis from Dan Vorhaus here.

Using the Linux shell and perl to delete files in the current directory that don't contain the right number of lines: If you want to get rid of all files in the current directory that don't have exactly 42 lines, run this code at the command line (*be very careful with this one!*): for f in *.txt;do perl -ne 'END{unlink $ARGV unless $.==42}' ${f} ;done

The previously mentioned Hitchhiker's Guide to Next-Generation Sequencing by Gabe Rudy at Golden Helix is now available in PDF format here. You can also find the related post describing all the various file formats used in NGS in PDF format here.

The Washington Post ran an article about the Khan Academy (, which has thousands of free video lectures, mostly on math. There are also a few computer science lectures that teach Python programming. (Salman Khan also appeared on the Colbert Report a few months ago).

Finally, I stumbled across this old question on BioStar with lots of answers about methods for short read mapping with next-generation sequencing data.


And here are a few interesting papers I shared:

Nature Biotechnology: Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly

PLoS Genetics: Gene-Based Tests of Association

PLoS Genetics: Fine Mapping of Five Loci Associated with Low-Density Lipoprotein Cholesterol Detects Variants That Double the Explained Heritability

Nature Reviews Genetics: Systems-biology approaches for predicting genomic evolution

Genome Research: A comprehensively molecular haplotype-resolved genome of a European individual (paper about the importance of phase in genetic studies)

Nature Reviews Microbiology: Unravelling the effects of the environment and host genotype on the gut microbiome.


  1. That shell plus perl snippet has to be about the scariest code for unintended consequences I seen in a long time!

    Make a mistake or typo with the file globbing in the shell 'for' statement or the filter condition in the perl 'unless' modifier and you have just blown away (permanently!) a lot of your data. The unix command line has no undelete functionality.

    I would recommend making such a one-liner script move files to a waste-basket directory rather than directly unlink them. That way mistakes are recoverable.

    Perl and unix definitely allow you to shoot yourself in the foot, and that one-liner is more like a bazooka than a pistol. It will clear a massive amount of cruft with a minimum of effort, but without a safety catch might also do grievous bodily harm to a large genomics data set.

  2. Thank you so much for the SNPesp!


Note: Only a member of this blog may post a comment.

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.