Tuesday, July 17, 2012

Plotting the Frequency of Twitter Hashtag Usage Over Time with R and ggplot2

The 20th annual ISMB meeting was held over the last week in Long Beach, CA. It was an incredible meeting with lots of interesting and relevant talks, and lots of folks were tweeting the conference, usually with at least a few people in each concurrent session. I wrote the code below that uses the twitteR package to pull all the tweets about the meeting under the #ISMB hashtag. You can download that raw data here. I then use ggplot2 to plot the frequency of tweets about #ISMB over time in two hour windows for each day of the last week.

The code below also tabulates the total number of tweets by username, and plots the 40 most prolific. Interestingly several of the folks in this list weren't even at the meeting.


I'll update the plots above at the conclusion of the meeting.

Here's the code below. There's a limitation with this code - you can only retrieve a maximum of 1500 tweets per query without authenticating via OAuth before you receive a 403 error. The twitteR package had a good vignette about how to use the ROAuth package to do this, but I was never able to get it to work properly. The version on CRAN (0.9.1) has known issues, but even when rolling back to 0.9.0 or upgrading to 0.9.2 from the author's homepage, I still received the 403 signal. So my hackjob workaround was to write a loop to fetch all the tweets one day at a time and then flatten this into a single list before converting to a data frame. You still run into the limitation of only being able to retrieve the first 1500 for each day, but #ISMB never had more than 1500 any one day. If you can solve my ROAuth problem, please leave a comment or fork the code on GitHub.

library(twitteR)
library(ggplot2)
tweets <- list()
dates <- paste("2012-07-",11:18,sep="") # need to go to 18th to catch tweets from 17th
for (i in 2:length(dates)) {
print(paste(dates[i-1], dates[i]))
tweets <- c(tweets, searchTwitter("#ISMB", since=dates[i-1], until=dates[i], n=1500))
}
# Convert the list to a data frame
tweets <- twListToDF(tweets)
tweets <- unique(tweets)
# To ensure accuracy, make sure that there were no more than 1500 tweets in a single day.
# If there are 1500 on any single day, then you're truncating that day's tweets, and you'll
# need to try to get ROAuth (below) working.
tweets$date <- format(tweets$created, format="%Y-%m-%d")
table(tweets$date)
# @sciencestream is a spambot that's RT'ing everything on the #ISMB tag. Get rid of those.
tweets <- tweets[which(tweets$screenName!="sciencestream"), ]
# Make a table of the number of tweets per user
d <- as.data.frame(table(tweets$screenName))
d <- d[order(d$Freq, decreasing=T), ]
names(d) <- c("User","Tweets")
head(d)
# Plot the table above for the top 40
png("ismb-users.png", w=700, h=1000)
par(mar=c(5,10,2,2))
with(d[rev(1:40), ], barplot(Tweets, names=User, horiz=T, las=1, main="Top 40: Tweets per User", col=1))
dev.off()
# Plot the frequency of tweets over time in two hour windows
# Modified from http://michaelbommarito.com/2011/03/12/a-quick-look-at-march11-saudi-tweets/
minutes <- 120
ggplot(data=tweets, aes(x=created)) +
geom_bar(aes(fill=..count..), binwidth=60*minutes) +
scale_x_datetime("Date") +
scale_y_continuous("Frequency") +
opts(title="#ISMB Tweet Frequency July 11-17", legend.position='none')
ggsave(file='ismb-frequency.png', width=7, height=7, dpi=100)
# Use imagemagick to stitch together. Imagemagick must be installed in your path.
# montage ismb-frequency.png ismb-users.png -tile 1x -geometry -0-0 montage.png
# system("montage ismb-frequency.png ismb-users.png -tile 1x -geometry -0-0 montage.png")
#------------------------------------ Using the ROAuth package ---------------------------------------#
# ## If you can get this to work it's a bit more flexible and doesn't have the 1500/day limit as above.
# ## Using ROAuth will theoretically allow you to retrieve more than 1500 tweets with a single query.
# ## The current version on cran, 0.9.1, has known problems. Supposedly rolling back to 0.9.0 would work,
# ## and it did return TRUE after registerTwitterOAuth(cred) after the handshake, but I kept getting
# ## forbidden errors when trying to retrieve more than 1500. Also happened with version 0.9.2.
#
# ## Current ROAuth from CRAN is 0.9.1. Has problems with handshake.
# install.packages("ROAuth")
# remove.packages("ROAuth")
#
# ## Install 0.9.1 from CRAN (doesn't work)
# install.packages("ROAuth")
#
# ## Using ROAuth 0.9.0 from source
# download.file("http://cran.r-project.org/src/contrib/Archive/ROAuth/ROAuth_0.9.0.tar.gz", destfile="ROAuth_0.9.0.tar.gz")
# install.packages("ROAuth_0.9.0.tar.gz", repos = NULL, type="source")
# library(ROAuth)
#
# ## Using ROAuth 0.9.2 from source
# download.file("http://geoffjentry.hexdump.org/ROAuth_0.9.2.tar.gz", destfile="ROAuth_0.9.2.tar.gz")
# install.packages("ROAuth_0.9.2.tar.gz", repos = NULL, type="source")
# library(ROAuth)
#
# ## The twitteR vignette has decent instructions
# vignette("twitteR")
#
# cred <- OAuthFactory$new(consumerKey="your_consumer_key_here",
# consumerSecret="your_secret_key_here",
# requestURL="https://api.twitter.com/oauth/request_token",
# accessURL="http://api.twitter.com/oauth/access_token",
# authURL="http://api.twitter.com/oauth/authorize")
# cred$handshake()
#
# ## You can save and load your credential object once you complete the handshake
# save(cred, file="~/Dropbox/code/misc/cred.RData")
# load("~/Dropbox/code/misc/cred.RData")
#
# registerTwitterOAuth(cred)
#
# tweets <- searchTwitter("#ISMB", since="2012-07-11", until="2012-07-18", n=9999)
# tweets <- twListToDF(tweets)
# ## Continue as above
view raw ismb_twitter.R hosted with ❤ by GitHub



Friday, July 6, 2012

Fix Overplotting with Colored Contour Lines

I saw this plot in the supplement of a recent paper comparing microarray results to RNA-seq results. Nothing earth-shattering in the paper - you've probably seen a similar comparison many times before - but I liked how they solved the overplotting problem using heat-colored contour lines to indicate density. I asked how to reproduce this figure using R on Stack Exchange, and my question was quickly answered by Christophe Lalanne.

Here's the R code to generate the data and all the figures here.
# Generate some data
library(MASS)
set.seed(101)
n <- 50000
X <- mvrnorm(n, mu=c(.5,2.5), Sigma=matrix(c(1,.6,.6,1), ncol=2))
# A color palette from blue to yellow to red
library(RColorBrewer)
k <- 11
my.cols <- rev(brewer.pal(k, "RdYlBu"))
## compute 2D kernel density, see MASS book, pp. 130-131
z <- kde2d(X[,1], X[,2], n=50)
# Make the base plot
plot(X, xlab="X label", ylab="Y label", pch=19, cex=.4)
# Draw the colored contour lines
contour(z, drawlabels=FALSE, nlevels=k, col=my.cols, add=TRUE, lwd=2)
# Add lines for the mean of X and Y
abline(h=mean(X[,2]), v=mean(X[,1]), col="gray", lwd=1.5)
# Add the correlation coefficient to the top left corner
legend("topleft", paste("R=", round(cor(X)[1,2],3)), bty="n")
## Other methods to fix overplotting
# Make points smaller - use a single pixel as the plotting charachter
plot(X, pch=".")
# Hexbinning
library(hexbin)
plot(hexbin(X[,1], X[,2]))
# Make points semi-transparent
library(ggplot2)
qplot(X[,1], X[,2], alpha=I(.1))
# The smoothScatter function (graphics package)
smoothScatter(X)


Here's the problem: there are 50,000 points in this plot causing extreme overplotting. (This is a simple multivariate normal distribution, but if the distribution were more complex, overplotting might obscure a relationship in the data that you didn't know about).
I liked the solution they used in the paper referenced above. Contour lines were placed throughout the data indicating the density of the data in that region. Further, the contour lines were "heat" colored from blue to red, indicating increasing data density. Optionally, you can add vertical and horizontal lines that intersect the means, and a legend that includes the absolute correlation coefficient between the two variables.

There are many other ways to solve an overplotting problem - reducing the size of the points, making points transparent, using hex-binning. 


Using a single pixel for each data point:




Using hexagonal binning to display density (hexbin package):


Finally, using semi-transparency (10% opacity; easiest using the ggplot2 package):


Edit July 7, 2012 - From Pete's comment below, the smoothScatter() function in the build in graphics package produces a smoothed color density representation of the scatterplot, obtained through a kernel density estimate. You can change the colors using the colramp option, and change how many outliers are plotted with the nrpoints option. Here, 100 outliers are plotted as single black pixels - outliers here being points in the areas of lowest regional density.



How do you deal with overplotting when you have many points?
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.