Thursday, May 30, 2013

PLATO, an Alternative to PLINK

Since the near beginning of genome-wide association studies, the PLINK software package (developed by Shaun Purcell’s group at the Broad Institute and MGH) has been the standard for manipulating the large-scale data produced by these studies.  Over the course of its development, numerous features and options were added to enhance its capabilities, but it is best known for the core functionality of performing quality control and standard association tests. 

Nearly 10 years ago (around the time PLINK was just getting started), the CHGR Computational Genomics Core (CGC) at Vanderbilt University started work on a similar framework for implementing genotype QC and association tests.  This project, called PLATO, has stayed active primarily to provide functionality and control that (for one reason or another) is unavailable in PLINK.  We have found it especially useful for processing ADME and DMET panel data – it supports QC and association tests of multi-allelic variants.    

PLATO runs via command line interface, but accepts a batch file that allows users to specify an order of operations for QC filtering steps.  When running multiple QC steps in a single run of PLINK, the order of application is hard-coded and not well documented.  As a result, users wanting this level of control must run a sequence of PLINK commands, generating new data files at each step leading to longer compute times and disk usage.  PLATO also has a variety of data reformatting options for other genetic analysis programs, making it easy to run EIGENSTRAT, for example.

The detail of QC output from each of the filtering steps is much greater in PLATO, allowing output per group (founders only, parents only, etc), and giving more details on why samples fail sex checks, Hardy-Weinberg checks, and Mendelian inconsistencies to facilitate deeper investigation of these errors.  And with family data, disabling samples due to poor genotype quality retains pedigree information useful for phasing and transmission tests. Full documentation and download links can be found here (https://chgr.mc.vanderbilt.edu/plato).  Special thanks to Yuki Bradford in the CGC for her thoughts on this post.  

Wednesday, May 15, 2013

Automated Archival and Visual Analysis of Tweets Mentioning #bog13, Bioinformatics, #rstats, and Others

Automatically Archiving Twitter Results

Ever since Twitter gamed its own API and killed off great services like IFTTT triggers, I've been looking for a way to automatically archive tweets containing certain search terms of interest to me. Twitter's built-in search is limited, and I wanted to archive interesting tweets for future reference and to start playing around with some basic text / trend analysis.

Enter t - the twitter command-line interface. t is a command-line power tool for doing all sorts of powerful Twitter queries using the command line. See t's documentation for examples.

I wrote this script that uses the t utility to search Twitter separately for a set of specified keywords, and append those results to a file. The comments at the end of the script also show you how to commit changes to a git repository, push to GitHub, and automate the entire process to run twice a day with a cron job. Here's the code as of May 14, 2013:

#!/bin/sh
## twitterchive.sh
## Stephen Turner (stephenturner.us)
##
## Script uses the t command line client (https://github.com/sferik/t)
## to search twitter for keywords stored in the arr variable below.
##
## Must first install the t gem and authenticate with OAuth.
##
## Twitter enforces some API limits to how many tweets you can search for
## in one query, and how many queries you can execute in a given period.
##
## I'm not sure what these limitations are, but I've hit them a few times.
## To be safe, I would limit the number of queries to ~5, $n to ~200, and
## run no more than a couple times per day.
## declare an array variable containing all your search terms.
## prefix any hashtags with a \
declare -a arr=(bioinformatics metagenomics rna-seq \#rstats)
## How many results would you like for each query?
n=250
## cd into where the script is being executed from.
DIR="$(dirname "$(readlink $0)")"
cd $DIR
echo $DIR
echo $(pwd)
echo
## now loop through the above array
for query in ${arr[@]}
do
## if your query contains a hashtag, remove the "#" from the filename
filename=$DIR/${query/\#/}.txt
echo "Query:\t$query"
echo "File:\t$filename"
## create the file for storing tweets if it doesn't already exist.
if [ ! -f $filename ]
then
touch $filename
fi
## use t (https://github.com/sferik/t) to search the last $n tweets in the query,
## concatenating that output with the existing file, sort and uniq that, then
## write the results to a tmp file.
search_cmd="t search all -ldn $n '$query' | cat - $filename | sort | uniq | grep -v ^ID > $DIR/tmp"
echo "Search:\t$search_cmd"
eval $search_cmd
## rename the tmp file to the original filename
rename_cmd="mv $DIR/tmp $filename"
echo "Rename:\t$rename_cmd"
eval $rename_cmd
echo
done
## push changes to github.
## errors running git push via cron necessitated authenticating over ssh instead of https
# git commit -a -m "Update search results: $(date)"
# git push origin master
## Run with a cronjob: 00 12 * * * cd /path/to/twitterchive/ && ./twitterchive.sh
view raw twitterchive.sh hosted with ❤ by GitHub


That script, and results for searching for "bioinformatics", "metagenomics", "#rstats", "rna-seq", and "#bog13" (the Biology of Genomes 2013 meeting) are all in the GitHub repository below. (Please note that these results update dynamically, and searching Twitter at any point could possibly result in returning some unsavory Tweets.)

https://github.com/stephenturner/twitterchive

Analyzing Tweets using R

You'll also find an analysis subdirectory, containing some R code to produce barplots showing the number of tweets per day over the last month, frequency of tweets by hour of the day, the most used hashtags within a search, the most prolific tweeters, and a ubiquitous word cloud. Much of this code is inspired by Neil Saunders's analysis of Tweets from ISMB 2012. Here's the code as of May 14, 2013:

## Most of this code was adapted near-verbatim from Neil's post about ISMB 2012.
## http://nsaunders.wordpress.com/2012/08/16/twitter-coverage-of-the-ismb-2012-meeting-some-statistics/
## Modify this. This is where I keep this repo.
repoDir <- ("~/code/twitterchive/")
## Go to the analysis directory
setwd(paste(repoDir, "analysis", sep=""))
## Function needs better documentation
twitterchivePlots <- function (filename=NULL) {
## Load required packages
require(tm)
require(wordcloud)
require(RColorBrewer)
if (class(filename)!="character") stop("filename must be character")
if (!file.exists(filename)) stop(paste("File does not exist:", filename))
searchTerm <- sub("\\.txt", "", basename(filename))
message(paste("Filename:", filename))
message(paste("Search Term: ", searchTerm))
## Read in the data and munge around the dates.
## I can't promise the fixed widths will always work out for you.
message("Reading in data.")
trim.whitespace <- function(x) gsub("^\\s+|\\s+$", "", x) # Function to trim leading and trailing whitespace from character vectors.
d <- read.fwf(filename, widths=c(18, 14, 18, 1000), stringsAsFactors=FALSE, comment.char="")
d <- as.data.frame(sapply(d, trim.whitespace))
names(d) <- c("id", "datetime", "user", "text")
d$user <- sub("@", "", d$user)
d$datetime <- as.POSIXlt(d$datetime, format="%b %d %H:%M")
d$date <- as.Date(d$datetime)
d$hour <- d$datetime$hour
d <- na.omit(d) # CRs cause a problem. explain this later.
head(d)
## Number of tweets by date for the last n days
recentDays <- 30
message(paste("Plotting number of tweets by date in the last", recentDays, "days."))
recent <- subset(d, date>=(max(date)-recentDays))
byDate <- as.data.frame(table(recent$date))
names(byDate) <- c("date", "tweets")
png(paste(searchTerm, "barplot-tweets-by-date.png", sep="--"), w=1000, h=700)
par(mar=c(8.5,4,4,1))
with(byDate, barplot(tweets, names=date, col="black", las=2, cex.names=1.2, cex.axis=1.2, mar=c(10,4,4,1), main=paste("Number of Tweets by Date", paste("Term:", searchTerm), sep="\n")))
dev.off()
# ggplot(byDate) + geom_bar(aes(date, tweets), stat="identity", fill="black") + theme_bw() + ggtitle("Number of Tweets by Date") + theme(axis.text.x=element_text(angle=90, hjust=1))
## Number of tweets by hour
message("Plotting number of tweets by hour.")
byHour <- as.data.frame(table(d$hour))
names(byHour) <- c("hour", "tweets")
png(paste(searchTerm, "barplot-tweets-by-hour.png", sep="--"), w=1000, h=700)
with(byHour, barplot(tweets, names.arg=hour, col="black", las=1, cex.names=1.2, cex.axis=1.2, main=paste("Number of Tweets by Hour", paste("Term:", searchTerm), paste("Date:", Sys.Date()), sep="\n")))
dev.off()
# ggplot(byHour) + geom_bar(aes(hour, tweets), stat="identity", fill="black") + theme_bw() + ggtitle("Number of Tweets by Hour")
## Barplot of top 20 hashtags
message("Plotting top 20 hashtags.")
words <- unlist(strsplit(d$text, " "))
head(table(words))
ht <- words[grep("^#", words)]
ht <- tolower(ht)
ht <- gsub("[^A-Za-z0-9]", "", ht) # remove anything not starting with a letter or number
ht <- as.data.frame(table(ht))
ht <- subset(ht, ht!="") # remove blanks
ht <- ht[sort.list(ht$Freq, decreasing=TRUE), ]
ht <- ht[-1, ] # remove the term you're searching for? it usually dominates the results.
ht <- head(ht, 20)
head(ht)
png(paste(searchTerm, "barplot-top-hashtags.png", sep="--"), w=1000, h=700)
par(mar=c(5,10,4,2))
with(ht[order(ht$Freq), ], barplot(Freq, names=ht, horiz=T, col="black", las=1, cex.names=1.2, cex.axis=1.2, main=paste("Number of Tweets by Hour", paste("Term:", searchTerm), paste("Date:", Sys.Date()), sep="\n")))
dev.off()
# ggplot(ht) + geom_bar(aes(ht, Freq), fill = "black", stat="identity") + coord_flip() + theme_bw() + ggtitle("Top hashtags")
## Top Users
message("Plotting most prolific users.")
users <- as.data.frame(table(d$user))
colnames(users) <- c("user", "tweets")
users <- users[order(users$tweets, decreasing=T), ]
users <- subset(users, user!=searchTerm)
users <- head(users, 20)
head(users)
png(paste(searchTerm, "barplot-top-users.png", sep="--"), w=1000, h=700)
par(mar=c(5,10,4,2))
with(users[order(users$tweets), ], barplot(tweets, names=user, horiz=T, col="black", las=1, cex.names=1.2, cex.axis=1.2, main=paste("Most prolific users", paste("Term:", searchTerm), paste("Date:", Sys.Date()), sep="\n")))
dev.off()
## Word clouds
message("Plotting a wordcloud.")
words <- unlist(strsplit(d$text, " "))
words <- grep("^[A-Za-z0-9]+$", words, value=T)
words <- tolower(words)
words <- words[-grep("^[rm]t$", words)] # remove "RT"
words <- words[!(words %in% stopwords("en"))] # remove stop words
words <- words[!(words %in% c("mt", "rt", "via", "using", 1:9))] # remove RTs, MTs, via, and single digits.
wordstable <- as.data.frame(table(words))
wordstable <- wordstable[order(wordstable$Freq, decreasing=T), ]
wordstable <- wordstable[-1, ] # remove the hashtag you're searching for? need to functionalize this.
head(wordstable)
png(paste(searchTerm, "wordcloud.png", sep="--"), w=800, h=800)
wordcloud(wordstable$words, wordstable$Freq, scale = c(8, .2), min.freq = 3, max.words = 200, random.order = FALSE, rot.per = .15, colors = brewer.pal(8, "Dark2"))
#mtext(paste(paste("Term:", searchTerm), paste("Date:", Sys.Date()), sep=";"), cex=1.5)
dev.off()
message(paste(searchTerm, ": All done!\n"))
}
filelist <- list("../bioinformatics.txt", "../metagenomics.txt", "../rstats.txt", "../rna-seq.txt")
lapply(filelist, twitterchivePlots)
view raw twitterchive.r hosted with ❤ by GitHub


Also in that analysis directory you'll see periodically updated plots for the results of the queries above.

Analyzing Tweets mentioning "bioinformatics"

Using the bioinformatics query, here are the number of tweets per day over the last month:


Here is the frequency of "bioinformatics" tweets by hour:

Here are the most used hashtags (other than #bioinformatics):

Here are the most prolific bioinformatics Tweeps:

Here's a wordcloud for all the bioinformatics Tweets since March:

Analyzing Tweets mentioning "#bog13"

The 2013 CSHL Biology of Genomes Meeting took place May 7-11, 2013. I searched and archived Tweets mentioning #bog13 from May 1 through May 14 using this script. You'll notice in the code above that I'm no longer archiving this hashtag. I probably need a better way to temporarily add keywords to the search, but I haven't gotten there yet.

Here are the number of Tweets per day during that period. Tweets clearly peaked a couple days into the meeting, with follow-up commentary trailing off quickly after the meeting ended.


Here is the frequency frequency of Tweets by hour, clearly bimodal:

Top hashtags (other than #bog13). Interestingly #bog14 was the most highly used hashtag, so I'm guessing lots of folks are looking forward to next years' meeting. Also, #ashg12 got lots of mentions, presumably because someone presented updated work from last years' ASHG meeting.

Here were the most prolific Tweeps - many of the usual suspects here, as well as a few new ones (new to me at least):

And finally, the requisite wordcloud:


More analysis

If you look in the analysis directory of the repo you'll find plots like these for other keywords (#rstats, metagenomics, rna-seq, and others to come). I would also like to do some sentiment analysis as Neil did in the ISMB post referenced above, but the sentiment package has since been removed from CRAN. I hear there are other packages for polarity analysis, but I haven't yet figured out how to use them. I've given you the code to do the mundane stuff (parsing the fixed-width files from t, for starters). I'd love to see someone take a stab at some further text mining / polarity / sentiment analysis!

twitterchive - archive and analyze results from a Twitter search

Monday, May 6, 2013

Three Metagenomics Papers for You

A handful of good metagenomics papers have come out over the last few months. Below I've linked to and copied my evaluation of each of these articles from F1000.

...

1. Willner, Dana, and Philip Hugenholtz. "From deep sequencing to viral tagging: Recent advances in viral metagenomics." BioEssays (2013). 

My evaluation: This review lays out some of the challenges and recent advances in viral metagenomic sequencing. There is a good discussion of library preparation and how that affects downstream sequencing. Alarmingly, they reference another paper that showed that different amplification methods resulted in detection of a completely different set of viruses (dsDNA viruses with LASL, ssDNA with MDA). The review also discusses many of the data management, analysis, and bioinformatics challenges associated with viral metagenomics.

...

2. Loman, Nicholas J., et al. "A Culture-Independent Sequence-Based Metagenomics Approach to the Investigation of an Outbreak of Shiga-Toxigenic Escherichia coli O104: H4Outbreak of Shiga-toxigenic Escherichia coli." JAMA 309.14 (2013): 1502-1510.

My evaluation: This paper is a groundbreaking exploration of the use of metagenomics to investigate and determine the causal organism of an infectious disease outbreak. The authors retrospectively collected fecal samples from symptomatic patients from the 2011 Escherichia coli O104:H4 outbreak in Germany and performed high-throughput shotgun sequencing, followed by a sophisticated analysis to determine the outbreak's causal organism. The analysis included comparing genetic markers from many symptomatic patients' metagenomes with those of healthy controls, followed by de novo assembly of the outbreak strain from the shotgun metagenomic data. This illustrates both the power, but the real limitations, of using metagenomic approaches for clinical diagnostics. Also see David Relman's synopsis of the study in the same JAMA issue

...

3. Shakya, Migun, et al. "Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities." Environmental microbiology (2013).

My evaluation: This study set out to compare shotgun metagenomic sequencing to 16S rRNA amplicon sequencing to determine the taxonomic and abundance profiles of mixed community metagenomic samples. Thus far, benchmarking metagenomic methodology has been difficult due to the lack of datasets where the underlying ground truth is known. In this study, the researchers constructed synthetic metagenomic communities consisting of 64 laboratory mixed genome DNAs of known sequence and polymerase chain reaction (PCR)-validated abundance. The researchers then compared metagenomic and 16S amplicon sequencing, using both 454 and Illumina technology, and found that metagenomic sequencing outperformed 16S sequencing in quantifying community composition. The synthetic metagenomes constructed here are publicly available (Gene Expression Omnibus [GEO] accession numbers are given in the manuscript), which represent a great asset to other researchers developing methods for amplicon-based or metagenomic approaches to sequence classification, diversity analysis, and abundance estimation.
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.