Getting Genetics Done: 2013

Tuesday, December 31, 2013

Jeff Leek's non-comprehensive list of awesome things other people did in 2013

Jeff Leek, biostats professor at Johns Hopkins and instructor of the Coursera Data Analysis course, recently posted on Simly Statistics this list of awesome things other people accomplished in 2013 in genomics, statistics, and data science.

At risk of sounding too meta, I'll say that this list itself is one of the awesome things that was put together in 2013. You should go browse the entire post for yourself, but I'll highlight a few that I saved to my reading list:

Yuwen Liu and colleagues published a paper on power for RNA-seq experiments comparing biological replicates and sequencing depth. This paper adds to previous work showing conclusively that adding more replicates is usually better than adding more sequencing depth.
Thomas Lumley's brief but insightful discussion of statins and the causal Markov property. Lumley also linked to a Labhacks post on a high-quality scrunchable cloth poster print for $25.
Joe Blitzstein's Data Science course at Harvard, which has an excellent reading list (many are IPython Notebooks, which can be read freely and downloaded to run locally.
A link to Lior Pachter's blog, which I also regularly read for all things related to RNA-seq and beyond.
Tuuli Lappalainen's paper on RNA-seq reproducibility. You can also follow Tuuli on Twitter (@tuuliel).
On the metagenomics front, the metagenomeSeq R package for differential abundance in microbial communities from Mihai Pop's group, and from Curtis Huttenhower's group, a new method for predictive functional profiling of microbial communities using 16S rRNA marker gene sequences.
Finally, because Jeff's post was about awesome things other people did in 2013, he deliberately omitted some of the awesome things that his own group and others at Johns Hopkins did in 2013. Some of Jeff's contributions that I found most useful in 2013 were several guides, available as GitHub repository READMEs:

jtleek/rpackages: A guide to developing R packages using devtools, roxygen2, knitr, git, and GitHub.
jtleek/datasharing: A guide to how to share data with a statistician. This is a great post to send to your collaborators who constantly send you inconsistently coded data without a metadata code book.
jtleek/reviews: A guide to reviewing academic papers. We've all gotten back terrible, often times useless reviews. Follow the advice here to avoid being one of those awful reviewers we all despise.

This only a sample of what's posted on Jeff's blog. Go read the full post below.

Simply Statistics: A non-comprehensive list of awesome things other people did this year.

Wednesday, December 18, 2013

Curoverse raises $1.5M to develop & support an open-source bioinformatics data analysis platform

Boston-based startup Curoverse has announced $1.5 million in funding to develop and support the open-source Arvados platform for cloud-based bioinformatics & genomics data analysis.

The Arvados platform was developed in George Church's lab by scientists and engineers led by Alexander Wait Zaranek, now scientific director at Curoverse. According to the Arvados wiki:

Arvados is a platform for storing, organizing, processing, and sharing genomic and other biomedical big data. The platform is designed to make it easier for bioinformaticians to develop analyses, developers to create genomic web applications and IT administers to manage large-scale compute and storage genomic resources. The platform is designed to run on top of "cloud operating systems" such as Amazon Web Services and OpenStack. Currently, there are implementations that work on AWS and Xen+Debian/Ubuntu. ... A set of relatively low-level compute and data management functions are consistent across a wide range of analysis pipelines and applications that are being built for genomic data. Unfortunately, every organization working with these data have been forced to build their own custom systems for these low level functions. At the same time, there are proprietary platforms emerging that seek to solve these same problems. Arvados was created to provide a common solution across a wide range of applications that would be free and open source.

A few questions should be apparent: What value does Arvados provide over other more widely used platforms (namely, Galaxy) that also aim to enable reproducibility, transparency, sharing, collaboration, and data/workflow management with biological big data? And how does Curoverse distinguish itself from other cloud-based bioinformatics services like Seven Bridges, DNA Nexus, and the next implement-GATK-on-Amazon-and-sell-it-back-to-me service provider that pops up? I understand that there are real costs with free software, but will the service that Curoverse provides be valuable and cost-effective enough to overcome the activation energy and make up for the "switching costs" that the average bioinformatician faces on adopting a new way of doing things? While the platform and the support model sound potentially very useful, these are all questions that the Curoverse team will need to carefully consider in attracting new users.

Arvados open-source bioinformatics analysis platform: https://arvados.org/

Curoverse: https://curoverse.com/

Press Release: http://www.prweb.com/releases/2013/12/prweb11424292.htm

Monday, December 9, 2013

Biostar Tutorial: Cheat sheet for one-based vs zero-based coordinate systems

Obi Griffith over at Biostar put together this excellent cheat sheet for dealing with one-based and zero-based genomic coordinate systems. The cheat sheet visually explains the difference between zero and one-based coordinate systems, as well as how to indicate a position, SNP, range, or indel using both coordinate systems.

Biostar Tutorial: Cheat sheet for one-based vs zero-based coordinate systems

Wednesday, November 20, 2013

Using Database Joins to Compare Results Sets

One of the most powerful tools you can learn to use in genomics research is a relational database system, such as MySQL. These systems are fairly easy to setup and use, and provide users the ability to organize and manipulate data and statistical results with simple commands. As a graduate student (during the height of GWAS), this single skill quickly turned me into an “expert”. By storing the SNP lists for common GWAS platforms and some simple annotations from the UCSC and ENSEMBL databases, I could quickly provide lists of SNPs within a gene or collection of genes, or pull a list of SNPs that overlap two genotyping platforms. We even developed database modules that allowed us to easily define LD blocks within a database query (called LD-Spline).

Once you learn the basics of defining tables and loading data, you can start to join tables together, matching them on a common field. This is where the true power of a database system lies. Suppose you have two sets of results from a PLINK analysis, one from a discovery dataset and another from a replication. Rather than clumsily matching two sets of results within a spreadsheet application, a few simple queries within MySQL will tell you which SNPs are in common between the two sets, which were not found in the replication set, which SNPs were significant in the first set but not the second, etc.

The concept that makes these operations work is the idea of a primary key. A primary key is some field of a dataset that uniquely identifies each row of the table/dataset. In the above example of PLINK results, a good primary key might be the RS number of the SNP. You can also uniquely identify rows based on two columns, a concept known as a composite key – for example, the chromosome AND position of a SNP. Establishing a primary key allows MySQL to keep data stored in a sorted order and allows the matching operations for table joins to be performed much faster.

Having this sorted order from a primary key prevents MySQL from having to scan an entire table to find a specific value. Much like the index of a book, a primary key lets MySQL find a value within a table very quickly. If a table is small, having a primary key is not as critical; the computer can quickly scan the entire contents of the table for any query. If the table is large, however, a full scan of the entire table could be a costly operation, and the number of table scans required increases when doing a join. For example, if we join tables for our discovery and replication results sets, the database system will take the RS number for each entry from the discovery table and attempt to find a matching RS number in the replication table. If the replication table has the RS number as a primary key, the database system can very quickly find this entry. There is a fantastic post on the various types of database joins here.

Let's start by creating our database tables. A typical PLINK association output contains 12 columns (CHR, SNP, BP, A1, TEST, NMISS, OR, SE, L95, U95, STAT, P). In these tables, we've established the SNP column as the primary key. Recall that the primary key must uniquely identify each row of the table, so if there are multiple rows per SNP -- sometimes PLINK will report multiple TEST rows per SNP. If this is the case, we may need to either establish a composite key using PRIMARY KEY (`snp`,`test`), or simply eliminate these rows from the data file using an AWK command.

CREATE TABLE `discovery` (
 `chr` varchar(1),
        `snp` varchar(32),
        `bp` int, 
        `a1` varchar(1),
        `test` varchar(3),
        `nmiss` int,
        `or` float,
        `se` float,
        `l95` float,
        `u95` float,
        `stat` float,
 `p` float,
 PRIMARY KEY (`snp`)
);

CREATE TABLE `replication` (
       `chr` varchar(1),
       `snp` varchar(32),
       `bp` int,
       `a1` varchar(1),
       `test` varchar(3),
       `nmiss` int,
       `or` float,
       `se` float,
       `l95` float,
       `u95` float,
       `stat` float,
       `p` float,
       PRIMARY KEY (`snp`)
);

Before loading our data into these tables, a little pre-processing is helpful. To ensure that results are easy to read on the screen, PLINK developers used leading spaces in the column format for many PLINK outputs. These make loading the results into a database difficult. We can resolve this by running a simple SED command:

sed -r -e 's/\s+/\t/' -e 's/^\t//g' input-file.assoc.logistic > discovery.load

This will convert all spaces to tabs and will eliminate the leading spaces and write the results to a new file, discovery.load. Now lets load this file into our table, and repeat the procedure for our replication file.

LOAD DATA LOCAL INFILE '{PathToFile}/discovery.load' INTO TABLE 
discovery FIELDS TERMINATED BY '\t' IGNORE 1 LINES;

Now we should have two MySQL database tables with the discovery and results sets loaded into them. We can view their contents with a simple select statement. Then, finally, we can join these two tables to easily compare the results from the discovery and replication analyses.

SELECT * FROM discovery INNER JOIN replication ON 
discovery.snp = replication.snp;

The syntax is simple: select a set of fields -- in this case all of them (represented by the *) -- from the first table (discovery), matching each row from this table to a row in the second table (replication) where the discovery SNP equals the replication SNP. MySQL also supports a table alias which can make these queries a bit easier to write. An alias is simply a label specified after a table name which can be used in the rest of the query in place of the full table name. For example, in the query below, we use a for the discovery table and b for the replication table.

SELECT * FROM discovery a INNER JOIN replication b ON 
a.snp = b.snp;

With practice and additional data, join operations can be used to annotate results by gene or region, and to match these to results from other studies, such as the NHGRI GWAS catalog.

Wednesday, November 6, 2013

A Mitochondrial Manhattan Plot

Manhattan plots have become the standard way to visualize results for genetic association studies, allowing the viewer to instantly see significant results in the rough context of their genomic position. Manhattan plots are typically shown on a linear X-axis (although the circos package can be used for radial plots), and this is consistent with the linear representation of the genome in online genome browsers. Many genetic studies often overlook the other genome within all our cells, the mitochondrial genome. This circular molecule has been shown to be associated (albeit inconsistently) with many disease traits, and functional variants from this genome are now included in most genotyping platforms.

Thanks to the clever work of several graduate students in the Vanderbilt Center for Human Genetics Research (most notably Ben Grady), mitochondrial genetic associations can be visualized in the context of key regions of the mitochondrial genome using a radial plot in the R package ggplot2.

To make this plot, download the code and run the script (alternatively open the script in R and run interactively):

On the command line:

git clone git@github.com:stephenturner/solarplot.git
cd solarplot
R CMD BATCH solarplot.R

GitHub: Mitochondrial solar plot code and data

Monday, November 4, 2013

Archival and analysis of #GI2013 Tweets

I archived and analyzed all Tweets containing #GI2013 from the recent Cold Spring Harbor Genome Informatics meeting, using my previously described code.

Friday was the most Tweeted day. Perhaps this was due to Lior Pachter's excellent keynote, "Stories from the Supplement."

There was clearly a bimodal distribution in the Tweets by hour, corresponding to the morning and evening sessions:

Top hashtags used other than #GI2013:

Most prolific users:

And of course, the much-maligned word cloud:

Archive of all #GI2013 Tweets

R code for analysis

Thursday, October 31, 2013

Real-time streaming differential RNA-seq analysis with eXpress

RNA-seq has been performed routinely for at least 5+ years, yet there is no consensus on the best methodology for analyzing this data. For example, Eduardo Eyras's group recently posted a pre-print on methods to study splicing from RNA-seq, where this great figure was shown:

This illustrates the problem clearly: ignoring alternative workflows that simply count reads mapping to genes or exons and doing a negative binomial-based test, there are thousands of potential paths through an isoform-resolution RNA-seq analysis.

While the gene/exon-count based approach is simpler and arguably more powerful and well-characterized, there are numerous potential problems with this approach, as outlined in the recent cuffdiff 2 paper:

Reproduced from the cuffdiff2 paper under fair use.

I'm not going to go into the merits of feature-count methods versus transcript deconvolution methods - that discussion is best settled by others.

But if you have the coverage needed for isoform reconstruction, perhaps the most commonly-trodden path through the transcript-resolution differential expression analysis is using TopHat for alignment, Cufflinks for assembly, and Cuffdiff for quantitation and differential expression analysis, as described in this recent protocol.

However, if you've ever used cufflinks with lots of samples or lots of reads, you've noted the exponential increase in computational resources necessary to run the analysis. This problem, as well as the performance of an alternative approach, is illustrated in Fig. 2b in the recent publication about a new tool from Lior Pachter's lab, the un-Google-ably named eXpress, for streaming real-time fragment assignment of RNA-seq data:

Reproduced from the eXpress paper under Fair Use.

I won't attempt to explain here how eXpress works other than to tell you it's EM-based, and to direct you to the paper for a fuller explanation. As you can see from the figure above, the resource requirement with more samples and higher coverage increases linearly, consuming only slightly more RAM than the UNIX wc (word count) command, all the while maintaining accuracy comparable to or slightly better than existing state-of-the-art methods like Cufflinks or RSEM.

So, what's the hold-up? Why isn't everyone using eXpress for differential gene/transcript expression with RNA-seq data? Personal preferences and allegiances aside, part of the reason might be because eXpress's estimated fragment abundances are not true counts, and are not optimally treated as such. Strictly speaking, you can't simply take the transcript abundances you get out of eXpress and throw them into a count-based test like those implemented in edgeR or DESeq, and expect those results to be ~~robust and accurate~~ optimal. What's lacking is a mathematical framework (and a user-friendly R or other software package) for conducting statistical analysis of differential abundance across multiple samples of transcript abundances as estimated by eXpress.

I ran into Lior at the CSHL Genome Informatics meeting this morning, and pressed him on when we might see an R package for statistically analyzing eXpress-estimated isoform abundances, and I was told we would see something within a month. I'm going to hold you to that, Lior, so keep an eye out on Bioconductor and Lior's blog for the much needed and long-awaited statistical framework and R package to do this analysis.

#GI2013 folks, I'll see you at the poster session and reception. And to everyone else, as always, keep calm and sequence on.

Monday, October 28, 2013

Analysis of #ASHG2013 Tweets

I archived and anlayzed all Tweets with the hashtag #ASHG2013 using my previously mentioned code.

Number of Tweets by date shows Wednesday was the most Tweeted day:

The top used hashtags other than #ASHG2013:

The most prolific users:

And what Twitter analysis would be complete without the widely loved, and more widely hated word cloud:

Edit 8:24am: I have gotten notes that some Tweets were not captured in this archive. This year's ASHG was very actively Tweeted. Unfortunately there are API limits restricting how many results I can return using the t Twitter command line client. A search during a particularly active time of day might have truncated some search results. Please feel free to send me a pull request if you think there's something I can do to improve the automated search code!

Tuesday, October 22, 2013

PubMed Commons: One post-publication peer review forum to rule them all?

Several post-publication peer review forums already exist, such as Faculty of 1000 or PubPeer, that facilitate discussion of papers after they have already been published. F1000 only allows a small number of "faculty" to comment on articles, and access to read commentary requires a paid subscription. PubPeer and similar startup services lack a critical mass of participants to make such a community truly useful. And while the Twitter-/blogosphere space is great for discussing scientific research, commentary is fragmented, scattered across many blogs, Google+, and Twitter feeds (not to mention, discussion content is owned by Twitter, Google, and other dot-com services with no guarantee of permanence).

Enter PubMed Commons.

PubMed Commons is a system that enables researchers to share their opinions about scientific publications. Researchers can comment on any publication indexed by PubMed, and read the comments of others. PubMed Commons is a forum for open and constructive criticism and discussion of scientific issues. It will thrive with high quality interchange from the scientific community. PubMed Commons is currently in a closed pilot testing phase, which means that only invited participants can add and view comments in PubMed.

PubMed Commons is currently in an invitation-only pilot phase. But if this catches on, it could easily become the post-publication peer review forum that solves many of the problems mentioned above. PubMed commons would be open to anyone that registers with a MyNCBI ID, meaning discussion would not be limited to a select "faculty" of few. Discussion would be collated and hosted by NCBI/NLM, which I like to think has a longer half-life than Google's latest foray into social networking or other dot-com ventures. Most importantly, PubMed use is ubiquitous. Whether you use PubMed's search directly or you land on a PubMed abstract from Google Scholar, you'll almost always link out to a paper from a PubMed abstract. This means that the momentum and critical mass to make a forum like this actually useful already exists.

The platform for publishing comments looks pretty simple (and even supports Markdown syntax!):

Another critical feature is the ability to invite the authors of the paper to join the discussion:

Right now PubMed Commons is invitation-only, but I'm optimistic about the public launch and hope to see this catch on.

PubMed Commons: http://www.ncbi.nlm.nih.gov/pubmedcommons/

Monday, October 21, 2013

Useful Unix/Linux One-Liners for Bioinformatics

Much of the work that bioinformaticians do is munging and wrangling around massive amounts of text. While there are some "standardized" file formats (FASTQ, SAM, VCF, etc.) and some tools for manipulating them (fastx toolkit, samtools, vcftools, etc.), there are still times where knowing a little bit of Unix/Linux is extremely helpful, namely awk, sed, cut, grep, GNU parallel, and others.

This is by no means an exhaustive catalog, but I've put together a short list of examples using various Unix/Linux utilities for text manipulation, from the very basic (e.g., sum a column) to the very advanced (munge a FASTQ file and print the total number of reads, total number unique reads, percentage of unique reads, most abundant sequence, and its frequency). Most of these examples (with the exception of the SeqTK examples) use built-in utilities installed on nearly every Linux system. These examples are a combination of tactics I used everyday and examples culled from other sources listed at the top of the page.

The list is available as a README in this GitHub repo. This list is a start - I would love suggestions for other things to include. To make a suggestion, leave a comment here, or better - open an issue, or even better still - send me a pull request.

Useful one-liners for bioinformatics: https://github.com/stephenturner/oneliners

Alternatively, download a PDF here.

Thursday, October 10, 2013

De Novo Transcriptome Assembly with Trinity: Protocol and Videos

One of the clearest advantages RNA-seq has over array-based technology for studying gene expression is not needing a reference genome or a pre-existing oligo array. De novo transcriptome assembly allows you to study non-model organisms, cancer cells, or environmental metatranscriptomes. One of the challenges with de novo transcriptome assembly, above and beyond all the challenges associated with genome assembly, is the highly varying abundance (and thus uneven sequencing depth) of different transcripts in a cell.

Several tools have been developed for de novo transcriptome assembly. One of the most widely used is Trinity, developed at the Broad Institute. Trinity is free and open-source, and a recent Nature Protocols article walks through using Trinity for de novo RNA-seq:

Haas, Brian J., et al. "De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis." Nature protocols 8.8 (2013): 1494-1512.

In addition, Trinity's creator, Brian Haas, has published four videos on YouTube on de novo transcriptome assembly using Trinity (via RNA-Seq Blog):

Introduction to De Novo RNA-Seq Assembly using Trinity

The General Approach to De novo RNA-Seq Assembly Using De Bruijn Graphs

Trinity - How it works

Strand-specific RNA-Seq is Preferred

Finally, if you're at UVA, we'll be hosting a transcriptome assembly workshop here in November, and registration will be opening soon.

Wednesday, August 21, 2013

Utility script for launching bare JAR files

Torsten Seemann compiled a list of minimum standards for bioinformatics command line tools, things like printing help when no commands are specified, including version info, avoid hardcoded paths, etc. These should be obvious to any seasoned software engineer, but many of these standards are not followed in bioinformatics.

#8 on the list was "Don't distribute bare JAR files." This is particularly annoying, requiring a user to invoke the software using something like: java -Xmx1000m -jar /path/on/my/system/to/software.jar . There are a few notable offenders in bioinformatics out there (I'm looking at you, Trimmomatic, snpEff, GATK...).

A very simple solution to the bare JAR file problem is distributing your java tool with a shell script wrapper that makes it easier for your users to invoke. E.g., if I have GATK installed in ~/bin/ngs/gatk/GenomeAnalysisTK.jar, I can create this shell script at ~/bin/ngs/gatk/gatk (replace GenomeAnalysisTK.jar with someOtherBioTool.jar):

Once I make that script executable and include that directory in my path, calling GATK is much simpler:

Yes, I'm fully aware that making my own JAR launcher utility scripts for existing software will make my code less reproducible, but for quick testing and development I don't think it matters. The tip has the best results when JAR files are distributed from the developer with utility scripts for invoking them.

See the post below for more standards that should be followed in bioinformatics software development.

Torsten Seeman: Minimum standards for bioinformatics command line tools

Monday, August 12, 2013

Understanding the ENSEMBL Schema

ENSEMBL is a frequently used resource for various genomics and transcriptomics tasks. The ENSEMBL website and MART tools provide easy access to their rich database, but ENSEMBL also provides flat-file downloads of their entire database and a public MySQL portal. You can access this using the MySQL Workbench using the following:

Host: useastdb.ensembl.org
User: anonymous

Once inside, you can get a sense for what the ENSEMBL schema (or data model) is like. First, it’s important to understand the ENSEMBL ID system. Most of the primary entities in the ENSEMBL database (genes, exons, transcripts, proteins) have a formal, stable identifier (beginning with ENSG, ENSE, ENST, and ENSP respectively) that does not change from build to build. These entries can be found in the gene_stable_id tables. All of these entities also have an internal identifier (an integer). Once you have an internal ID for the entity of interest, details of the entity can be found in the genes, exons, transcripts, and translations (proteins) table. For example, the following query will retrieve a list of all transcripts and their exons for a given gene.

SELECT * FROM gene_stable_id a
inner join gene b on a.gene_id = b.gene_id
inner join transcript c on b.gene_id = c.gene_id
inner join exon_transcript d on c.transcript_id = d.transcript_id
inner join exon e on d.exon_id = e.exon_id
inner join transcript_stable_id f on c.transcript_id = f.transcript_id
inner join exon_stable_id g on e.exon_id = g.exon_id

The exon_transcript table contains a mapping of each exon to any transcripts containing it, and also contains a rank to indicate which exon it is relative to a given transcript. To retrieve exons for a list of genes by their ENSEMBL IDs, these could be loaded into a table and joined to the gene_stable_id table in the query above. To pull the build 37 chromosome and coordinates for an exon, use the following:

Select a.exon_id, b.name, a.seq_region_start, a.seq_region_end from exon a
inner join seq_region b on a.seq_region_id = b.seq_region_id
inner join coord_system c on b.coord_system_id = c.coord_system_id
where c.version = "GRCh37";

In this query, the seq_region table contains a field called name that identifies the contig to which the coordinates refer, in this case the chromosome number.

There are also extensive cross-references in the ENSEMBL database. To retrieve alternate identifiers for a set of transcripts, execute the following:

select * from transcript_stable_id a
inner join transcript b on a.transcript_id = b.transcript_id
inner join object_xref c on b.transcript_id = c.ensembl_id
inner join xref d on c.xref_id = d.xref_id
inner join external_db e on d.external_db_id = e.external_db_id
where ensembl_object_type = "Transcript"
limit 20;

ENSEMBL organizes cross-references (xrefs) for all entity types into a single table object_xref. This table contains an ensemble_object_type field that is a “Transcript”, “Gene”, or “Translation”, and an ensemble_id that matches either a gene_id, transcript_id, or a translation_id. Replace “transcript” in the above query with “gene” or “translation” to retrieve gene or protein cross-references. A list of all external cross-reference sources can be found by querying:

Select db_name from external_db;

There is a great deal of information within the ENSEMBL database that can be accessed using SQL, which for some types of operations is easier than using the MART or web interface. Full details of the ENSEMBL schema can be found here (http://useast.ensembl.org/info/docs/api/core/core_schema.html)

Monday, August 5, 2013

Google Developers R Programming Video Lectures

Google Developers recognized that most developers learn R in bits and pieces, which can leave significant knowledge gaps. To help fill these gaps, they created a series of introductory R programming videos. These videos provide a solid foundation for programming tools, data manipulation, and functions in the R language and software. The series of short videos is organized into four subsections: intro to R, loading data and more data formats, data processing and writing functions. Start watching the YouTube playlist here, or watch an individual lecture below:

1.1 - Initial Setup and Navigation
1.2 - Calculations and Variables
1.3 - Create and Work With Vectors
1.4 - Character and Boolean Vectors
1.5 - Vector Arithmetic
1.6 - Building and Subsetting Matrices
1.7 - Section 1 Review and Help Files
2.1 - Loading Data and Working With Data Frames
2.2 - Loading Data, Object Summaries, and Dates
2.3 - if() Statements, Logical Operators, and the which() Function
2.4 - for() Loops and Handling Missing Observations
2.5 - Lists
3.1 - Managing the Workspace and Variable Casting
3.2 - The apply() Family of Functions
3.3 - Access or Create Columns in Data Frames, or Simplify a Data Frame using aggregate()
4.1 - Basic Structure of a Function
4.2 - Returning a List and Providing Default Arguments
4.3 - Add a Warning or Stop the Function Execution
4.4 - Passing Additional Arguments Using an Ellipsis
4.5 - Make a Returned Result Invisible and Build Recursive Functions
4.6 - Custom Functions With apply()

Wednesday, July 24, 2013

Archival, Analysis, and Visualization of #ISMBECCB 2013 Tweets

As the 2013 ISMB/ECCB meeting is winding down, I archived and analyzed the 2000+ tweets from the meeting using a set of bash and R scripts I previously blogged about.

The archive of all the tweets tagged #ISMBECCB from July 19-24, 2013 is and will forever remain here on Github. You'll find some R code to parse through this text and run the analyses below in the same repository, explained in more detail in my previous blog post.

Number of tweets by date:

Number of tweets by hour:

Most popular hashtags, other than #ismbeccb. With separate hashtags for each session, this really shows which other SIGs and sessions were well-attended. It also shows the popularity of the unofficial ISMB BINGO card.

Most prolific users. I'm not sure who or what kind of account @sciencstream is - seems like spam to me.

And the obligatory word cloud:

Friday, July 12, 2013

Course Materials from useR! 2013 R/Bioconductor for Analyzing High-Throughput Genomic Data

At last week's 2013 useR! conference in Albacete, Spain, Martin Morgan and Marc Carlson led a course on using R/Bioconductor for analyzing next-gen sequencing data, covering alignment, RNA-seq, ChIP-seq, and sequence annotation using R. The course materials are online here, including R code for running the examples, the PDF vignette tutorial, and the course material itself as a package.

Course Materials from useR! 2013 R/Bioconductor for Analyzing High-Throughput Genomic Data

Tuesday, July 2, 2013

Customize your .Rprofile and Keep Your Workspace Clean

Like your .bashrc, .vimrc, or many other dotfiles you may have in your home directory, your .Rprofile is sourced every time you start an R session. On Mac and Linux, this file is usually located in ~/.Rprofile. On Windows it's buried somewhere in the R program files. Over the years I've grown and pruned my .Rprofile to set various options and define various "utility" functions I use frequently at the interactive prompt.

One of the dangers of defining too many functions in your .Rprofile is that your code becomes less portable, and less reproducible. For example, if I were to define adf() as a shortcut to as.data.frame(), code that I send to other folks using adf() would return errors that the adf object doesn't exist. This is a risk that I'm fully aware of in regards to setting the option stringsAsFactors=FALSE, but it's a tradeoff I'm willing to accept for convenience. Most of the functions I define here are useful for exploring interactively. In particular, the n() function below is handy for getting a numbered list of all the columns in a data frame; lsp() and lsa() list all functions in a package, and list all objects and classes in the environment, respectively (and were taken from Karthik Ram's .Rprofile); and the o() function opens the current working directory in a new Finder window on my Mac. In addition to a few other functions that are self-explanatory, I also turn off those significance stars, set a default CRAN mirror so it doesn't ask me all the time, and source in the biocLite() function for installing Bioconductor packages (note: this makes R require web access, which might slow down your R initialization).

Finally, you'll notice that I'm creating a new hidden environment, and defining all the functions here as objects in this hidden environment. This allows me to keep my workspace clean, and remove all objects from that workspace without nuking any of these utility functions.

I used to keep my .Rprofile synced across multiple installations using Dropbox, but now I keep all my dotfiles in a single git-versioned directory, symlinked where they need to go (usually ~/). My .Rprofile is below: feel free to steal or adapt however you'd like.

Friday, June 7, 2013

ENCODE ChIP-Seq Significance Tool: Which TFs Regulate my Genes?

I collaborate with several investigators on gene expression projects using both microarray and RNA-seq. After I show a collaborator which genes are dysregulated in a particular condition or tissue, the most common question I get is "what are the transcription factors regulating these genes?"

This isn't the easiest question to answer. You could look at transcription factor binding site position weight matrices like those from TRANSFAC and come up with a list of all factors that potentially hit that site, then perform some kind of enrichment analysis on that. But this involves some programming, and is based solely on sequence motifs, not experimental data.

The ENCODE consortium spent over $100M and generated hundreds of ChIP-seq experiments for different transcription factors and histone modifications across many cell types (if you don't know much about ENCODE, go read the main ENCODE paper, and Sean Eddy's very fair commentary). Regardless of what you might consider "biologically functional", the ENCODE project generated a ton of data, and much of this data is publicly available. But that still doesn't help answer our question, because genes are often bound by multiple TFs, and TFs can bind many regions. We need to perform an enrichment (read: hypergeometric) test to assess an over-representation of experimentally bound transcription factors around our gene targets of interest ("around" also implies that some spatial boundary must be specified). To date, I haven't found a good tool to do this easily.

Raymond Auerbach and Bin Chen in Atul Butte's lab recently developed a resource to address this very common need, called the ENCODE ChIP-Seq Significance Tool.

The paper: Auerbach et al. Relating Genes to Function: Identifying Enriched Transcription Factors using the ENCODE ChIP-Seq Significance Tool. Bioinformatics (2013): 10.1093/bioinformatics/btt316.

The software: ENCODE ChIP-Seq Significance Tool (http://encodeqt.stanford.edu/).

This tool takes a list of "interesting" (significant, dysregulated, etc.) genes as input, and identifies ENCODE transcription factors from this list. Head over to http://encodeqt.stanford.edu/, select the ID type you're using (Ensembl, Symbol, etc), and paste in your list of genes. You can also specify your background set (this has big implications for the significance testing using the hypergeometric distribution). Scroll down some more to tell the tool how far up and downstream you want to look from the transcription start/end site or whole gene, select an ENCODE cell line (or ALL), and hit submit.

You're then presented with a list of transcription factors that are most likely regulating your input genes (based on overrepresentation of ENCODE ChIP-seq binding sites). Your results can then be saved to CSV or PDF. You can also click on a number in the results table and get a list of genes that are regulated by a particular factor (the numbers do not appear as hyperlinks in my browser, but clicking the number still worked).

At the very bottom of the page, you can load example data that they used in the supplement of their paper, and run through the analysis presented therein. The lead author, Raymond Auerbach, even made a very informative screencast on how to use the tool:

Now, if I could only find a way to do something like this with mouse gene expression data.

Butte Lab: ENCODE ChIP-Seq Significance Tool (Read the Bioinformatics paper here)

Thursday, May 30, 2013

PLATO, an Alternative to PLINK

Since the near beginning of genome-wide association studies, the PLINK software package (developed by Shaun Purcell’s group at the Broad Institute and MGH) has been the standard for manipulating the large-scale data produced by these studies. Over the course of its development, numerous features and options were added to enhance its capabilities, but it is best known for the core functionality of performing quality control and standard association tests.

Nearly 10 years ago (around the time PLINK was just getting started), the CHGR Computational Genomics Core (CGC) at Vanderbilt University started work on a similar framework for implementing genotype QC and association tests. This project, called PLATO, has stayed active primarily to provide functionality and control that (for one reason or another) is unavailable in PLINK. We have found it especially useful for processing ADME and DMET panel data – it supports QC and association tests of multi-allelic variants.

PLATO runs via command line interface, but accepts a batch file that allows users to specify an order of operations for QC filtering steps. When running multiple QC steps in a single run of PLINK, the order of application is hard-coded and not well documented. As a result, users wanting this level of control must run a sequence of PLINK commands, generating new data files at each step leading to longer compute times and disk usage. PLATO also has a variety of data reformatting options for other genetic analysis programs, making it easy to run EIGENSTRAT, for example.

The detail of QC output from each of the filtering steps is much greater in PLATO, allowing output per group (founders only, parents only, etc), and giving more details on why samples fail sex checks, Hardy-Weinberg checks, and Mendelian inconsistencies to facilitate deeper investigation of these errors. And with family data, disabling samples due to poor genotype quality retains pedigree information useful for phasing and transmission tests. Full documentation and download links can be found here (https://chgr.mc.vanderbilt.edu/plato). Special thanks to Yuki Bradford in the CGC for her thoughts on this post.

Wednesday, May 15, 2013

Automated Archival and Visual Analysis of Tweets Mentioning #bog13, Bioinformatics, #rstats, and Others

Automatically Archiving Twitter Results

Ever since Twitter gamed its own API and killed off great services like IFTTT triggers, I've been looking for a way to automatically archive tweets containing certain search terms of interest to me. Twitter's built-in search is limited, and I wanted to archive interesting tweets for future reference and to start playing around with some basic text / trend analysis.

Enter t - the twitter command-line interface. t is a command-line power tool for doing all sorts of powerful Twitter queries using the command line. See t's documentation for examples.

I wrote this script that uses the t utility to search Twitter separately for a set of specified keywords, and append those results to a file. The comments at the end of the script also show you how to commit changes to a git repository, push to GitHub, and automate the entire process to run twice a day with a cron job. Here's the code as of May 14, 2013:

That script, and results for searching for "bioinformatics", "metagenomics", "#rstats", "rna-seq", and "#bog13" (the Biology of Genomes 2013 meeting) are all in the GitHub repository below. (Please note that these results update dynamically, and searching Twitter at any point could possibly result in returning some unsavory Tweets.)

https://github.com/stephenturner/twitterchive

Analyzing Tweets using R

You'll also find an analysis subdirectory, containing some R code to produce barplots showing the number of tweets per day over the last month, frequency of tweets by hour of the day, the most used hashtags within a search, the most prolific tweeters, and a ubiquitous word cloud. Much of this code is inspired by Neil Saunders's analysis of Tweets from ISMB 2012. Here's the code as of May 14, 2013:

Also in that analysis directory you'll see periodically updated plots for the results of the queries above.

Analyzing Tweets mentioning "bioinformatics"

Using the bioinformatics query, here are the number of tweets per day over the last month:

Here is the frequency of "bioinformatics" tweets by hour:

Here are the most used hashtags (other than #bioinformatics):

Here are the most prolific bioinformatics Tweeps:

Here's a wordcloud for all the bioinformatics Tweets since March:

Analyzing Tweets mentioning "#bog13"

The 2013 CSHL Biology of Genomes Meeting took place May 7-11, 2013. I searched and archived Tweets mentioning #bog13 from May 1 through May 14 using this script. You'll notice in the code above that I'm no longer archiving this hashtag. I probably need a better way to temporarily add keywords to the search, but I haven't gotten there yet.

Here are the number of Tweets per day during that period. Tweets clearly peaked a couple days into the meeting, with follow-up commentary trailing off quickly after the meeting ended.

Here is the frequency frequency of Tweets by hour, clearly bimodal:

Top hashtags (other than #bog13). Interestingly #bog14 was the most highly used hashtag, so I'm guessing lots of folks are looking forward to next years' meeting. Also, #ashg12 got lots of mentions, presumably because someone presented updated work from last years' ASHG meeting.

Here were the most prolific Tweeps - many of the usual suspects here, as well as a few new ones (new to me at least):

And finally, the requisite wordcloud:

More analysis

If you look in the analysis directory of the repo you'll find plots like these for other keywords (#rstats, metagenomics, rna-seq, and others to come). I would also like to do some sentiment analysis as Neil did in the ISMB post referenced above, but the sentiment package has since been removed from CRAN. I hear there are other packages for polarity analysis, but I haven't yet figured out how to use them. I've given you the code to do the mundane stuff (parsing the fixed-width files from t, for starters). I'd love to see someone take a stab at some further text mining / polarity / sentiment analysis!

twitterchive - archive and analyze results from a Twitter search

Monday, May 6, 2013

Three Metagenomics Papers for You

A handful of good metagenomics papers have come out over the last few months. Below I've linked to and copied my evaluation of each of these articles from F1000.

...

1. Willner, Dana, and Philip Hugenholtz. "From deep sequencing to viral tagging: Recent advances in viral metagenomics." BioEssays (2013).

My evaluation: This review lays out some of the challenges and recent advances in viral metagenomic sequencing. There is a good discussion of library preparation and how that affects downstream sequencing. Alarmingly, they reference another paper that showed that different amplification methods resulted in detection of a completely different set of viruses (dsDNA viruses with LASL, ssDNA with MDA). The review also discusses many of the data management, analysis, and bioinformatics challenges associated with viral metagenomics.

...

2. Loman, Nicholas J., et al. "A Culture-Independent Sequence-Based Metagenomics Approach to the Investigation of an Outbreak of Shiga-Toxigenic Escherichia coli O104: H4Outbreak of Shiga-toxigenic Escherichia coli." JAMA 309.14 (2013): 1502-1510.

My evaluation: This paper is a groundbreaking exploration of the use of metagenomics to investigate and determine the causal organism of an infectious disease outbreak. The authors retrospectively collected fecal samples from symptomatic patients from the 2011 Escherichia coli O104:H4 outbreak in Germany and performed high-throughput shotgun sequencing, followed by a sophisticated analysis to determine the outbreak's causal organism. The analysis included comparing genetic markers from many symptomatic patients' metagenomes with those of healthy controls, followed by de novo assembly of the outbreak strain from the shotgun metagenomic data. This illustrates both the power, but the real limitations, of using metagenomic approaches for clinical diagnostics. Also see David Relman's synopsis of the study in the same JAMA issue

...

3. Shakya, Migun, et al. "Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities." Environmental microbiology (2013).

My evaluation: This study set out to compare shotgun metagenomic sequencing to 16S rRNA amplicon sequencing to determine the taxonomic and abundance profiles of mixed community metagenomic samples. Thus far, benchmarking metagenomic methodology has been difficult due to the lack of datasets where the underlying ground truth is known. In this study, the researchers constructed synthetic metagenomic communities consisting of 64 laboratory mixed genome DNAs of known sequence and polymerase chain reaction (PCR)-validated abundance. The researchers then compared metagenomic and 16S amplicon sequencing, using both 454 and Illumina technology, and found that metagenomic sequencing outperformed 16S sequencing in quantifying community composition. The synthetic metagenomes constructed here are publicly available (Gene Expression Omnibus [GEO] accession numbers are given in the manuscript), which represent a great asset to other researchers developing methods for amplicon-based or metagenomic approaches to sequence classification, diversity analysis, and abundance estimation.

Thursday, April 4, 2013

List of Bioinformatics Workshops and Training Resources

I frequently get asked to recommend workshops or online learning resources for bioinformatics, genomics, statistics, and programming. I compiled a list of both online learning resources and in-person workshops (preferentially highlighting those where workshop materials are freely available online):

List of Bioinformatics Workshops and Training Resources

I hope to keep the page above as up-to-date as possible. Below is a snapshop of what I have listed as of today. Please leave a comment if you're aware of any egregious omissions, and I'll update the page above as appropriate.

From http://stephenturner.us/p/edu, April 4, 2013

In-Person Workshops:

Cold Spring Harbor Courses: meetings.cshl.edu/courses.html

Cold Spring Harbor has been offering advanced workshops and short courses in the life sciences for years. Relevant workshops include Advanced Sequencing Technologies & Applications, Computational & Comparative Genomics, Programming for Biology, Statistical Methods for Functional Genomics, the Genome Access Course, and others. Unlike most of the others below, you won't find material from past years' CSHL courses available online.

Canadian Bioinformatics Workshops: bioinformatics.ca/workshops
Bioinformatics.ca through its Canadian Bioinformatics Workshops (CBW) series began offering one and two week short courses in bioinformatics, genomics and proteomics in 1999. The more recent workshops focus on training researchers using advanced high-throughput technologies on the latest approaches being used in computational biology to deal with the new data. Course material from past workshops is freely available online, including both audio/video lectures and slideshows. Topics include microarray analysis, RNA-seq analysis, genome rearrangements, copy number alteration,network/pathway analysis, genome visualization, gene function prediction, functional annotation, data analysis using R, statistics for metabolomics, and much more.

UC Davis Bioinformatics Training Program: training.bioinformatics.ucdavis.edu
The UC Davis Bioinformatics Training program offers several intensive short bootcamp workshops on RNA-seq, data analysis and visualization, and cloud computing with a focus on Amazon's computing resources. They also offer a week-long Bioinformatics Short Course, covering in-depth the practical theory and application of cutting-edge next-generation sequencing techniques. Every course's documentation is freely available online, even if you didn't take the course.

MSU NGS Summer Course: bioinformatics.msu.edu/ngs-summer-course-2013
This intensive two week summer course will introduce attendees with a strong biology background to the practice of analyzing short-read sequencing data from Illumina and other next-gen platforms. The first week will introduce students to computational thinking and large-scale data analysis on UNIX platforms. The second week will focus on mapping, assembly, and analysis of short-read data for resequencing, ChIP-seq, and RNAseq. Materials from previous courses are freely available online under a CC-by-SA license.

Genetic Analysis of Complex Human Diseases: hihg.med.miami.edu/edu...
The Genetic Analysis of Complex Human Diseases is a comprehensive four-day course directed toward physician-scientists and other medical researchers. The course will introduce state-of-the-art approaches for the mapping and characterization of human inherited disorders with an emphasis on the mapping of genes involved in common and genetically complex disease phenotypes. The primary goal of this course is to provide participants with an overview of approaches to identifying genes involved in complex human diseases. At the end of the course, participants should be able to identify the key components of a study team, and communicate effectively with specialists in various areas to design and execute a study. The course is in Miami Beach, FL. (Full Disclosure: I teach a section in this course.) Most of the course material from previous years is not available online, but my RNA-seq & methylation lectures are on Figshare.

UAB Short Course on Statistical Genetics and Genomics: soph.uab.edu/ssg/...
Focusing on the state-of-art methodology to analyze complex traits, this five-day course will offer an interactive program to enhance researchers' ability to understand & use statistical genetic methods, as well as implement & interpret sophisticated genetic analyses. Topics include GWAS Design/Analysis/Imputation/Interpretation; Non-Mendelian Disorders Analysis; Pharmacogenetics/Pharmacogenomics; ELSI; Rare Variants & Exome Sequencing; Whole Genome Prediction; Analysis of DNA Methylation Microarray Data; Variant Calling from NGS Data; RNAseq: Experimental Design and Data Analysis; Analysis of ChIP-seq Data; Statistical Methods for NGS Data; Discovering new drugs & diagnostics from 300 billion points of data. Video recording from the 2012 course are available online.

MBL Molecular Evolution Workshop: hermes.mbl.edu/education/...
One of the longest-running courses listed here (est. 1988), the Workshop on Molecular Evolution at Woods Hole presents a series of lectures, discussions, and bioinformatic exercises that span contemporary topics in molecular evolution. The course addresses phylogenetic analysis, population genetics, database and sequence matching, molecular evolution and development, and comparative genomics, using software packages including AWTY, BEAST, BEST, Clustal W/X, FASTA, FigTree, GARLI, MIGRATE, LAMARC, MAFFT, MP-EST, MrBayes, PAML, PAUP*, PHYLIP, STEM, STEM-hy, and SeaView. Some of the course materials can be found by digging around the course wiki.

Online Material:

Canadian Bioinformatics Workshops: bioinformatics.ca/workshops
(In person workshop described above). Course material from past workshops is freely available online, including both audio/video lectures and slideshows. Topics include microarray analysis, RNA-seq analysis, genome rearrangements, copy number alteration, network/pathway analysis, genome visualization, gene function prediction, functional annotation, data analysis using R, statistics for metabolomics, andmuch more.

UC Davis Bioinformatics Training Program: training.bioinformatics.ucdavis.edu
(In person workshop described above). Every course's documentation is freely available online, even if you didn't take the course. Past topics include Galaxy, Bioinformatics for NGS, cloud computing, and RNA-seq.

MSU NGS Summer Course: bioinformatics.msu.edu/ngs-summer-course-2013
(In person workshop described above). Materials from previous courses are freely available online under a CC-by-SA license, which cover mapping, assembly, and analysis of short-read data for resequencing, ChIP-seq, and RNAseq.

EMBL-EBI Train Online: www.ebi.ac.uk/training/online
Train online provides free courses on Europe's most widely used data resources, created by experts at EMBL-EBI and collaborating institutes. Topics include Genes and Genomes, Gene Expression,Interactions, Pathways, and Networks, and others. Of particular interest may be the Practical Course on Analysis of High-Throughput Sequencing Data, which covers Bioconductor packages for short read analysis, ChIP-Seq, RNA-seq, and allele-specific expression & eQTLs.

UC Riverside Bioinformatics Manuals: manuals.bioinformatics.ucr.edu
This is an excellent collection of manuals and code snippets. Topics include Programming in R, R+Bioconductor, Sequence Analysis with R and Bioconductor, NGS analysis with Galaxy and IGV, basicLinux skills, and others.

Software Carpentry: software-carpentry.org
Software Carpentry helps researchers be more productive by teaching them basic computing skills. We recently ran a 2-day Software Carpentry Bootcamp here at UVA. Check out the online lectures for some introductory material on Unix, Python, Version Control, Databases, Automation, and many other topics.

Coursera: coursera.org/courses
Coursera partners with top universities to offer courses online for anytone to take, for free. Courses are usually 4-6 weeks, and consist of video lectures, quizzes, assignments, and exams. Joining a course gives you access to the course's forum where you can interact with the instructor and other participants. Relevant courses include Data Analysis, Computing for Data Analysis using R, and Bioinformatics Algorithms, among others. You can also view all of Jeff Leek's Data Analysis lectures on Youtube.
Rosalind: http://rosalind.info
Quite different from the others listed here, Rosalind is a platform for learning bioinformatics through gaming-like problem solving. Visit the Python Village to learn the basics of Python. Arm yourself at theBioinformatics Armory, equipping yourself with existing ready-to-use bioinformatics software tools. Or storm the Bioinformatics Stronghold, implementing your own algorithms for computational mass spectrometry, alignment, dynamic programming, genome assembly, genome rearrangements, phylogeny, probability, string algorithms and others.

Other Resources:

Titus Brown's list bioinformatics courses: Includes a few others not listed here (also see the comments).
GMOD Training and Outreach: GMOD is the Generic Model Organism Database project, a collection of open source software tools for creating and managing genome-scale biological databases. This page links out to tutorials on GMOD Components such as Apollo, BioMart, Galaxy, GBrowse, MAKER, and others.
Seqanswers.com: A discussion forum for anything related to Bioinformatics, including Q&A, paper discussions, new software announcements, protocols, and more.
Biostars.org: Similar to SEQanswers, but more strictly a Q&A site.
BioConductor Mailing list: A very active mailing list for getting help with Bioconductor packages. Make sure you do some Google searching yourself first before posting to this list.
Bioconductor Events: List of upcoming and prior Bioconductor training and events worldwide.
Learn Galaxy: Screencasts and tutorials for learning to use Galaxy.
Galaxy Event Horizon: Worldwide Galaxy-related events (workshops, training, user meetings) are listed here.
Galaxy RNA-Seq Exercise: Run through a small RNA-seq study from start to finish using Galaxy.
Rafael Irizarry's Youtube Channel: Several statistics and bioinformatics video lectures.
PLoS Comp Bio Online Bioinformatics Curriculum: A perspective paper by David B Searls outlining a series of free online learning initiatives for beginning to advanced training in biology, biochemistry, genetics, computational biology, genomics, math, statistics, computer science, programming, web development, databases, parallel computing, image processing, AI, NLP, and more.
Getting Genetics Done: Shameless plug – I write a blog highlighting literature of interest, new tools, and occasionally tutorials in genetics, statistics, and bioinformatics. I recently wrote this post about how to stay current in bioinformatics & genomics.

This blog has moved!