Showing posts with label Twitter. Show all posts

Wednesday, February 1, 2017

Staying Current in Bioinformatics & Genomics: 2017 Edition

A while back I wrote this post about how I stay current in bioinformatics & genomics. That was nearly five years ago. A lot has changed since then. A few links are dead. Some of the blogs or Twitter accounts I mentioned have shifted focus or haven’t been updated in years (guilty as charged). The way we consume media has evolved — Google thought they could kill off RSS (long live RSS!), there are many new literature alert services, preprints have really taken off in this field, and many more scientists are engaging via social media than before.

People still frequently ask me how I stay current and keep a finger on the pulse of the field. I’m not claiming to be able to do this well — that’s a near-impossible task for anyone. Five years later and I still run our bioinformatics core, and I’m still mostly focused on applied methodology and study design rather than any particular phenotype, model system, disease, or specific method. It helps me to know that transcript-level estimates improve gene-level inferences from RNA-seq data, and that there’s software to help me do this, but the details underlying kmer shredding vs pseudoalignment to a transcriptome de Bruijn graph aren’t as important to me as knowing that there’s a software implementation that’s well documented, actively supported, and performs well in fair benchmarks. As such, most of what I pay attention to is applied/methods-focused.

What follows is a scattershot, noncomprensive guide to the people, blogs, news outlets, journals, and aggregators that I lean on in an attempt to stay on top of things. I’ve inevitably omitted some key resources, so please don’t be offended if you don’t see your name/blog/Twitter/etc. listed here (drop a link in the comments!). Whatever I write here now will be out of date in no time, so I’ll try to write an update post every year instead of every five.

Twitter

In the 2012 post I ended with Twitter, but I have to lead with it this time. Twitter is probably my most valuable resource for learning about the bleeding-edge developments in genomics & bioinformatics. It’s great for learning what’s new and contributing to the dialogue in your field, but only when used effectively.

I aggressively prune the list of people I follow to keep what I see relevant and engaging. I can tolerate an occasional digression into politics, posting pictures of you drinking with colleagues at a conference, or self-congratulatory announcements. But once these off-topic Tweets become the norm, I unfollow. I also rely on the built-in list feature. I follow a few hundred people, but I only add a select few dozen to a “notjunk” list that I look at when I’m short on time. Folks in this list don’t Tweet too often and have a high signal-to-noise ratio (as far as what I’m interested in reading). If I don’t get a chance to catch up on my entire timeline, I can at least breeze through recent Tweets from folks on this list.

I’m also wary of following extremely prolific users. For example — if someone’s been on Twitter less than a year, already has 20,000 Tweets, but only 100 followers, it tells me they’ve got a lot to say but nobody cares. I let the hive mind work for me in this case, using this Tweet-to-follower ratio as sort of a proxy for signal-to-noise.

I mostly follow individuals and aggregators, but I also follow a few organization accounts. These can be a mixed bag. Only a few organization accounts do this well, delivering interesting and applicable content to a targeted audience, while many more are poor attempts at marketing and self-promotion while not offering any substantive value or interesting content.

Individuals: In no particular order, here’s an incomplete list of people who Tweet content that I find consistently on-topic and interesting.

Aaron Quinlan (aaronquinlan)
Adam Phillippy (aphillippy)
Andrew Severin (isugif)
Casey Greene (GreeneScientist)
Clive Brown (Clive_G_Brown)
Dan MacArthur (dgmacarthur)
David Robinson (drob)
Elisabeth Bik (MicrobiomDigest)
Frank Harrell (f2harrell)
Hadley Wickham (hadleywickham)
Heng Li (lh3lh3)
James Hadfield (coregenomics)
Jared Simpson (jaredtsimpson)
Jeff Leek (jtleek)
Jenny Bryan (JennyBryan)
Julia Silge (juliasilge)
Krista Ternus (KristaTernus)
Lex Nederbragt (lexnederbragt)
Lior Pachter (lpachter)
Mick Watson (biomickwatson)
Mike Love (mikelove)
Nick Loman (pathogenomenick)
Nicolas Robine (notSoJunkDNA)
Phil Ashton (flashton2003)
RNA-seq Blog (rnaseqblog)
Rob Patro (nomad421)
Roger Peng (rdpeng)
Sam Minot (sminot)
Sean Davis (seandavis12)
Titus Brown (ctitusbrown)
Torsten Seemann (torstenseemann)
Tuuli Lappalainen (tuuliel)
Vince Buffalo (vsbuffalo)
Willem van Schaik (WvSchaik)
Zamin Iqbal (ZaminIqbal)
Many more I’m failing to specifically mention…

Others: Besides individual accounts, there are also a number of aggregators and organizations that I keep on a high signal-to-noise list.

bioRxiv (biorxivpreprint)
bioRxiv Bioinfo (biorxiv_bioinfo)
bioRxiv Genomics (biorxiv_genomic)
Metagenomics Papers (metagenomic_lit)
InformaticsGW (UduakGW)
Hacker News 300 (newsyc300)
CompBiolPapers (compbiolpapers)
RNA-seq paper aggregator (RNA_seq)
Bioconductor (Bioconductor)
RStudio Tips (rstudiotips)

Blogs

I follow these and other blogs using RSS. I’ve been happy with the free version of Feedly ever since Google Reader was killed. The web interface and iOS app have everything I need, and they both integrate nicely with other services like Evernote, Instapaper, Buffer, Twitter, etc. If you can’t find a direct link to the blog’s RSS feed, you can usually type the name of the blog into Feedly’s search bar and it’ll find it for you. Similar to my “notjunk” list in Twitter, I have a Favorites category in Feedly where I include only the feeds I absolutely wouldn’t want to miss.

These are some of the few that I try to read whenever something new is posted, and Feedly helps me keep those organized, either by “starring” something I want to come back to, or saving it for later with Instapaper. They’re in no particular order, and I’m sure I’ve forgotten something.

Variance Explained: David Robinson’s blog (Data Scientist at Stack Overflow, works in R and Python).
Global Biodefense: News on pathogens, outbreaks, and preparedness, with periodic posts on genomics and bioinformatics-related developments and funding opportunities.
In between lines of code: Lex Nederbragt’s blog on biology, sequencing, bioinformatics, …
Simply Statistics: A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek.
Bits of DNA: Reviews and commentary on computational biology by Lior Pachter (fair warning: dialogue here can get a bit heated!).
Blue Collar Bioinformatics: articles related tool validation and the open source bioinformatics community.
Microbiome Digest - Bik’s Picks: A daily digest of scientific microbiome papers, by Elisabeth Bik, Science Editor at uBiome.
Living in an Ivory Basement: Titus Brown’s blog on metagenomics, open science, testing, reproducibility, and programming.
Enseqlopedia: James Hadfield’s blog on all things NGS.
Epistasis Blog: Jason Moore’s computational biology blog.
RStudio Blog: announcements about new RStudio functionality, updates about the tidyverse, and more.
nextgenseek.com: Next-Gen Sequencing Blog covering new developments in NGS data & analysis.
RNA-Seq Blog: Transcriptome Research & Industry News.
The Allium: We all need a little humor in our lives. Like The Onion, but for science.

Others

I’m unsure how to categorize the rest. These are things like aggregators, Q&A sites/forums, and others.

Nuzzel is something I’ve only been using for a few months but it works very well. It’s meant to solve the Twitter / social media overload problem. If you’re following a few hundred people, you could easily have thousands of Tweets per day to read through (or miss). Nuzzel emails you a daily newsletter of the most relevant content in your Twitter feed. I’m guessing it does this by analyzing how many people you follow share, retweet, or favorite the same links. I try to read everything in my RSS feeds but I could never do this with Twitter (nor should you worry about trying). Nuzzel helps you catch up on things that are trending among the people you follow. It’s not a substitute for following the right people (see the Twitter section above).
RWeekly: weekly updates from the entire R community. Offers an RSS feed but I subscribe to the weekly email. Each email sends out about 50 links with one-sentence descriptions to things being done in the R community that week.
R Bloggers aggregates RSS feeds from hundreds of blogs about R. Much more comprehensive than RWeekly, but lots to sort through.
GenomeWeb still provides high-quality original content as well as summaries of what’s going on in the field. Create an account, log in, view your profile page, and subscribe to some of their regular emails. I subscribe to their daily news, the scan, informatics, sequencing, and infectious diseases bulletins. Pro tip: Much of their content is only available for premium subscribers. If you sign up with a .edu address, you can access all this content for free.
F1000’s Smart Search is one of the few literature recommendation services that I find useful, relevant, and current. My RNA-seq and metagenomics alerts consistently deliver relevant and fresh content.
BioStars: This is a stack exchange Q&A site focused on bioinformatics, computational genomics, biological data analysis. You can go to the homepage and sort by topic, views, answers, etc., and the platform offers several granular ways to subscribe via RSS.
Bioconductor Support: This is a Q&A site much like BioStars that replaced the Bioconductor mailing list. You can do things like limit to a certain time period and sort by views, for example, if you only want to log in occasionally to see what’s being talked about.
SEQanswers: I subscribe to all new threads in the SEQanswers bioinformatics forum, and regularly browse post titles. When something sparks my interest, I’ll click into that post and subscribe to future updates on that post via email.
Google Scholar lets you search and create email alerts.
PubMed Alerts: You can save, automate, and have search results emailed to you through your MyNCBI account. Surprisingly, these seem to be more relevant than the Google Scholar searches for the terms that I use.
PubMed Trending - I have no idea how PubMed ranks these. It seemed to be more useful in the past, but now it seems that the top “trending” articles alternate between CRISPR/Cas9, and old kinesiology / sports medicine articles.
IFTTT: If This Then That is a service that connects many different web services together in an endless number of ways. At home I might connect Facebook and Dropbox, so that whenever someone tags me in a photo, that photo is automatically downloaded to my Dropbox. At work I can connect an RSS feed to an Evernote note or Google Doc. It’s useful is so many ways, both for personal and for work-related tasks. I mostly use it here as a last safeguard so that things I really shouldn’t miss don’t slip through the cracks. I have recipes that do things like email me if certain low-volume Twitter accounts post a new Tweet, others that automatically save to Instapaper things like starred articles in Feedly. I also use this to keep a close eye on a few accounts on GitHub. I have connections set up for a few users on GitHub so that whenever one of these users creates a new public repository, I get an email. I’ve also used IFTTT to archive Tweets coming out of various hashtags — you can create a recipe where if a new Tweet contains certain keywords or hashtags, then save that Tweet to Evernote, a shared Google Doc spreadsheet, etc. Zapier is a similar service that I’ve heard provides more granular control, but I haven’t tried it.
Podcasts: I listen to every episode of Roger Peng and Hilary Parker’s Not So Standard Deviations data science podcast, and most episodes of Roger Peng and Elizabeth Matsui’s The Effort Report (this one’s more about life in academia in general). I use the Overcast iOS app to listen to these and other podcasts on ~1.75X speed. (When I met Hilary at the RStudio Conference I heard her speak for the first time at regular 1X speed. Odd experience.) Finally, I just learned about the R podcast. I haven’t listened to much yet, but I’ve added it to my long Overcast queue.

Preprints!

Preprints in life sciences were nearly unheard of when I wrote the 2012 post. Now everybody’s doing it. There are still a few people using the arXiv Quantitative biology channel, and I’ll occasionally find something in PeerJ Preprints that grabs my attention.

bioRxiv is the biggest player here, hands down. The Alerts/RSS page lets you sign up for email alerts on particular topics, or subscribe to RSS feeds coming from particular categories that interest you. I subscribe to the Genomics and Bioinformatics feeds. I also follow several of the bioRxiv’s top-level and category Twitter feeds @biorxivpreprint, @biorxiv_bioinfo, and @biorxiv_genomic).

F1000 Research deserves some special attention here. It’s somewhere in-between a preprint server and a peer-reviewed publication. You can upload manuscripts (or other research outputs like posters or slides), and they’re immediately and permanently published, and given a DOI. Then one or more rounds of open peer review as well as public comment take place, and authors can update the published paper for further review. Check out the transcript estimates / gene inference paper I mentioned earlier. You’ll see it’s “version 2,” and was approved by two referees. If you look at the right-hand panel, you can actually go back and see the prior to revision, as well as see who reviewed it, what the reviewer wrote, and how the authors responded to those reviews. It’s an innovative platform where peer review is open and transparent, and is independent of publication, since papers are published before they are reviewed, and remain regardless of the outcome of the review. F1000 Research has a number of channels that are externally curated by different organizations, societies, conferences, etc. I subscribe to and get alerts about the R package and Bioconductor channels. Whenever a new preprint is dropped into one of these channels, I’ll get an email and an RSS item.

I only recently discovered PrePubMed, which looks very useful. PrePubMed indexes preprints from arXiv q-bio, PeerJ Preprints, bioRxiv, F1000Research, preprints.org, The Winnower, Nature Precedings, and Wellcome Open Research. In the tools box on the homepage, you can enter a search string and get back an RSS feed with results from that search. It looks like PrePubMed is maintained by a single person, but he’s made the entire thing open source, so you could presumably set this up and mirror it on your own, should you check back in 2021 and the link be dead.

Journals

I started with Journals in my 2012 post, but they’re last (and probably least) here. I still subscribe to a few journals’ RSS feeds, but in most cases, by the time I see a new Table of Contents hit my RSS reader, I probably saw the publications making the rounds on Twitter, blogs, or other channels mentioned above. It’s also no longer unusual to see a “publication” land where I read the preprint on biorXiv months ago, and perhaps even a blog post before that! What “publication” means is changing rapidly, and I’m sure the lines between a blog post, preprint, and journal article will be even blurrier in the year 2022 post.

How do you have the time to do this?

How do you not? It’s not as bad as it seems. I probably spend an hour each weekday scanning all the resources mentioned here, and I find the time well spent. I can breeze through my Twitter and RSS feeds on my bus ride into work, and saving things I actually want to look at later with a bookmark, star, favorite, Instapaper, etc.

I should have prefaced this whole article with the note that I hardly ever actually fully read any of the papers or blog posts I see here. If I see, for example, a new WGS variant caller published, I’ll glance at the figures benchmarking it against GATK and FreeBayes, and skim through the documentation on the GitHub README or BioConductor vignette. If either of these is missing or falls short, that’s usually enough for me to ignore the publication completely (don’t underestimate the importance of good documentation!).

It’s taken me a decade to compile and continually hone this list of resources to the things that I find useful and relevant. This is what works for me, now, in 2017. It’s not a one-size-fits-all, and the 2018-me will probably have a somewhat different list, but I hope you’ll find it useful. If your interests are similar to what I’ve discussed here, how do you stay current? What have I left out? Let me know in the comments!

Monday, October 28, 2013

Analysis of #ASHG2013 Tweets

I archived and anlayzed all Tweets with the hashtag #ASHG2013 using my previously mentioned code.

Number of Tweets by date shows Wednesday was the most Tweeted day:

The top used hashtags other than #ASHG2013:

The most prolific users:

And what Twitter analysis would be complete without the widely loved, and more widely hated word cloud:

Edit 8:24am: I have gotten notes that some Tweets were not captured in this archive. This year's ASHG was very actively Tweeted. Unfortunately there are API limits restricting how many results I can return using the t Twitter command line client. A search during a particularly active time of day might have truncated some search results. Please feel free to send me a pull request if you think there's something I can do to improve the automated search code!

Wednesday, May 15, 2013

Automated Archival and Visual Analysis of Tweets Mentioning #bog13, Bioinformatics, #rstats, and Others

Automatically Archiving Twitter Results

Ever since Twitter gamed its own API and killed off great services like IFTTT triggers, I've been looking for a way to automatically archive tweets containing certain search terms of interest to me. Twitter's built-in search is limited, and I wanted to archive interesting tweets for future reference and to start playing around with some basic text / trend analysis.

Enter t - the twitter command-line interface. t is a command-line power tool for doing all sorts of powerful Twitter queries using the command line. See t's documentation for examples.

I wrote this script that uses the t utility to search Twitter separately for a set of specified keywords, and append those results to a file. The comments at the end of the script also show you how to commit changes to a git repository, push to GitHub, and automate the entire process to run twice a day with a cron job. Here's the code as of May 14, 2013:

That script, and results for searching for "bioinformatics", "metagenomics", "#rstats", "rna-seq", and "#bog13" (the Biology of Genomes 2013 meeting) are all in the GitHub repository below. (Please note that these results update dynamically, and searching Twitter at any point could possibly result in returning some unsavory Tweets.)

https://github.com/stephenturner/twitterchive

Analyzing Tweets using R

You'll also find an analysis subdirectory, containing some R code to produce barplots showing the number of tweets per day over the last month, frequency of tweets by hour of the day, the most used hashtags within a search, the most prolific tweeters, and a ubiquitous word cloud. Much of this code is inspired by Neil Saunders's analysis of Tweets from ISMB 2012. Here's the code as of May 14, 2013:

Also in that analysis directory you'll see periodically updated plots for the results of the queries above.

Analyzing Tweets mentioning "bioinformatics"

Using the bioinformatics query, here are the number of tweets per day over the last month:

Here is the frequency of "bioinformatics" tweets by hour:

Here are the most used hashtags (other than #bioinformatics):

Here are the most prolific bioinformatics Tweeps:

Here's a wordcloud for all the bioinformatics Tweets since March:

Analyzing Tweets mentioning "#bog13"

The 2013 CSHL Biology of Genomes Meeting took place May 7-11, 2013. I searched and archived Tweets mentioning #bog13 from May 1 through May 14 using this script. You'll notice in the code above that I'm no longer archiving this hashtag. I probably need a better way to temporarily add keywords to the search, but I haven't gotten there yet.

Here are the number of Tweets per day during that period. Tweets clearly peaked a couple days into the meeting, with follow-up commentary trailing off quickly after the meeting ended.

Here is the frequency frequency of Tweets by hour, clearly bimodal:

Top hashtags (other than #bog13). Interestingly #bog14 was the most highly used hashtag, so I'm guessing lots of folks are looking forward to next years' meeting. Also, #ashg12 got lots of mentions, presumably because someone presented updated work from last years' ASHG meeting.

Here were the most prolific Tweeps - many of the usual suspects here, as well as a few new ones (new to me at least):

And finally, the requisite wordcloud:

More analysis

If you look in the analysis directory of the repo you'll find plots like these for other keywords (#rstats, metagenomics, rna-seq, and others to come). I would also like to do some sentiment analysis as Neil did in the ISMB post referenced above, but the sentiment package has since been removed from CRAN. I hear there are other packages for polarity analysis, but I haven't yet figured out how to use them. I've given you the code to do the mundane stuff (parsing the fixed-width files from t, for starters). I'd love to see someone take a stab at some further text mining / polarity / sentiment analysis!

twitterchive - archive and analyze results from a Twitter search

Tuesday, May 29, 2012

How to Stay Current in Bioinformatics/Genomics

A few folks have asked me how I get my news and stay on top of what's going on in my field, so I thought I'd share my strategy. With so many sources of information begging for your attention, the difficulty is not necessarily finding what's interesting, but filtering out what isn't. What you don't read is just as important as what you do, so when it comes to things like RSS, Twitter, and especially e-mail, it's essential to filter out sources where the content consistently fails to be relevant or capture your interest. I run a bioinformatics core, so I'm more broadly interested in applied methodology and study design rather than any particular phenotype, model system, disease, or method. With that in mind, here's how I stay current with things that are relevant to me. Please leave comments with what you're reading and what you find useful that I omitted here.

RSS

I get the majority of my news from RSS feeds from blogs and journals in my field. I spend about 15 minutes per day going through headlines from the following sources:

Journals. Most journals have separate RSS feeds for their current table of contents as well as their advance online ahead-of-print articles.

Blogs. Some of these blogs are very relevant to what I do on the job. Others are more personal interest.

The OpenHelix Blog
Ensembl blog
Galaxy News
Blue Collar Bioinformatics
Homologus
Golden Helix - our 2 SNPs
Genomics Law Report
R-bloggers (aggregates feeds from >350 blogs about R)
Genomes Unzipped
Jason Moore's Epistasis Blog
23andMe - the Spitoon

Forums.

Mailing lists

I prefer to keep work and personal email separate, but I have all my mailing list email sent to my Gmail because Gmail's search is better than any alternative. I have a filter set up to automatically filter and tag mailing list digests under a "Work" label so I can get to them (or filter them from my inbox) easily.

Bioconductor (daily digest)
Galaxy mailing lists. I subscribe to the -announce, -user, and -dev mailing lists, but I have a Gmail filter set up to automatically skip the inbox and mark read messages from the -user and -dev lists. I don't care to look at these every day, but again, it's handy to be able to use Gmail's search functionality to look through old mailing list responses.

Email Alerts & Subscriptions

Again, email can get out of hand sometimes, so I prefer to only have things that I really don't want to miss sent to my email. The rest I use RSS.

SeqAnswers subscriptions. When I ask a question or find a question that's relevant to something I'm working on, I subscribe to that thread for email alerts whenever a new response is posted.
Google Scholar alerts. I have alerts set up to send me emails based on certain topics (e.g. [ rna-seq | transcriptome sequencing | RNA-sequencing ] or [ intitle:"chip-seq" ]), or when certain people publish (e.g. ["ritchie md" & vanderbilt]). I also use this to alert me when certain consortia publish (e.g. ["Population Architecture using Genomics and Epidemiology"]).
PubMed Saved Searches using MyNCBI, because Google Scholar doesn't catch everything. I have alerts set up for RNA-seq, ChIP-Seq, bioinformatics methods, etc.
GenomeWeb subscriptions. Most of these are once per week, except Daily Scan. I subscribe to Daily Scan, Genome Technology, BioInform, Clinical Sequencing News, In Sequence, and Pharmacogenomics Reporter. BioInform has a "Bioinformatics Papers of Note", and In Sequence has a "Sequencing papers of note" column in every issue. These are good for catching things I might have missed with the Scholar and Pubmed alerts.

Twitter

99.9% of Twitter users have way too much time on their hands, but when used effectively, Twitter can be incredibly powerful for both consuming and contributing to the dialogue in your field. Twitter can be an excellent real-time source of new publications, fresh developments, and current opinion, but it can also quickly become a time sink. I can tolerate an occasional Friday afternoon humorous digression, but as soon as off-topic tweets become regular it's time to unfollow. The same is true with groups/companies - some deliver interesting and broadly applicable content (e.g. 23andMe), while others are purely a failed attempt at marketing while not offering any substantive value to their followers. A good place to start is by (shameless plug) following me or the people I follow (note: this isn't an endorsement of anyone on this list, and there are a few off-topic people I follow for my non-work interests). I can't possibly list everyone, but a few folks who tweet consistently on-topic and interesting content are: Daniel MacArthur, Jason Moore, Dan Vorhaus, 23andMe, OpenHelix, Larry Parnell, Francis Ouellette, Leonid Kruglyak, Sean Davis, Joe Pickrell, The Galaxy Project, J. Chris Pires, Nick Loman, and Andrew Severin. Also, a hashtag in twitter (prefixed by the #), is used to mark keywords or topics in Twitter. I occasionally browse through the #bioinformatics and #Rstats hashtag.

Friday, April 6, 2012

RNA-Seq Methods & March Twitter Roundup

There were lots of interesting developments this month that didn't work their way into a full blog post. Here is an incomplete list of what I've been tweeting about over the last few weeks. But first I want to draw your attention to the latest manuscript for a new bioconductor package for doing RNA-seq in R.

DEXSeq vs Cuffdiff. See the pre-publication manuscript from Simon Anders, Alejandro Reyes, and Wolfgang Huber: "Detecting differential usage of exons from RNA-Seq data." DEXSeq is an R package by the same guys who developed the DESeq R package and the HTSeq python scripts. (Incidentally, both DESeq and DEXSeq are rare examples of bioconductor vignettes which are well developed and are a pleasure to read). I often use cufflinks/cuffdiff in the bioinformatics core was because many other tools and methods only allow you to interrogate differential expression at the gene level. Using cufflinks for transcriptome assembly enables you to interrogate transcript/isoform expression, differential splicing, differential coding output, differential promoter usage, etc. DEXSeq uses similar methodology as DESeq, but can give you exon-level differential expression, without going through all the assembly business that cufflinks does. In one of the supplementary tables in their pre-pub manuscript, they compare several versions of cuffdiff to DEXSeq on two datasets. Both of these datasets had biological replicates for treatment and control conditions. They compared treatment to controls, and found DEXseq gave you more significant hits than cuffdiff. Then they compared controls to other controls (ideally should have zero hits) and found cufflinks had way more hits. See p13, p23, tables S1 and S2.

Proper comparison treatment vs control, # significant hits:

DEXSeq: 159

Cuffdiff 1.1: 145

Cuffdiff 1.2: 69

Cuffdiff 1.3: 50

Mock comparison controls vs controls, # significant hits:

DEXSeq: 8

Cuffdiff 1.1: 314

Cuffdiff 1.2: 650

Cuffdiff 1.3: 639

In the UVA Bioinformatics core we strive for reproducibility, scalability, and transparency using the most robust tools and methodology available. It gives me pause to see such alarmingly different results with each new version and each new protocol of a particular tool. What are your thoughts and experiences with using Cufflinks/Cuffdiff, DESeq/DEXSeq, or the many, many other tools for RNA-Seq (MISO, ExpressionPlot, EdgeR, RSEM, easyRNASeq, etc.)? Please share in the comments.

Everything else:

Webinar from @goldenhelixinc: Learning From Our GWAS Mistakes: From experimental design to scientific method https://t.co/KkxAn18p

Sequencing technology does not eliminate biological variability http://t.co/NI3acZgn

[bump] Questions on cutoff setting of FPKM value & know genes filtering in Cuffmerge result http://t.co/iKMZ7Dsd #bioinformatics

Very cool: DNAse-Seq+RNA-seq used to show DNaseI sensitivity eQTLs are a major determinant of gene expression variation http://t.co/nPo3xHVa

Beware using UCSC GTFs in HTSeq/CovergeBed for counting RNA-seq reads. "transcript_id" is repeated as "gene_id"! https://t.co/ADg1Pi6U

Google Scholar Metrics: top 20 journals in bioinformatics http://t.co/QYTf5pyT

A systematic eQTL study of cis-trans #epistasis in 210 HapMap individuals http://t.co/0YHlFQak

Identification of allele-specific alternative mRNA processing via RNA-seq http://t.co/fig9cLlH #bioinformatics @myen

prepub on arXiv + analysis tutorial/walkthrough + AWS EC2 AMI + git repo + ipython notebook = reproducible research done right http://t.co/GPNmpdJD

Altmetrics in the Wild: Using Social Media to Explore Scholarly Impact http://t.co/8xZuIHw9

NSF-NIH Interagency Initiative: Core Techniques and Technologies for Advancing Big Data Science and Engineering http://t.co/W3LUdCsG

New approach from @MarylynRitchie lab to collapsing/combining: using biological pathways rather than positional info http://t.co/ywWj0MNn

The Transcription Factor Encyclopedia http://t.co/DjJUwR10 Paper: http://t.co/b0J8PXO6

NF-kB: where did it come from and why? http://t.co/NLV1mBd0

Cloud BioLinux: pre-configured and on-demand #bioinformatics computing for the genomics community. @myen http://t.co/3kCE0ktH

SCOTUS remands AMP v Myriad (BRCA) patent case to CAFC to consider in light of prometheus decision http://t.co/7CkTa4l0

57 year experiment, Drosophila kept in dark for 1400 generations, many evolutionary changes (record longest postdoc!) http://t.co/wukq8fAf

IsoLasso: a LASSO regression approach to RNA-Seq based transcriptome assembly http://t.co/FN1sYM8f #bioinformatics

Complex disease genetics is complex. Imagine that. Hirschhorn, Visscher, & the usual consortium suspects: http://t.co/Bwopxlx6

MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets http://t.co/pffuHIlO #bioinformatics

Paper about SEQanswers forum published in #Bioinformatics http://t.co/SUlQ6O8c

num-utils - like awk, grep, sort, cut, etc for numbers http://t.co/bgkB5FMV

Nat Protocols: Differential gene & transcript expression analysis of RNA-seq w/ TopHat & Cufflinks http://t.co/U1ZpSE7V #bioinformatics

Tuesday, October 18, 2011

My thoughts on ICHG 2011

I’m a bit exhausted from a week of excellent science at ICHG. First, let me say that Montreal is a truly remarkable city with fantastic food and a fascinating blend of architectural styles, all making the meeting a fun place to be…. Now on to the genomics – I’ll recap a few of the most exciting sessions I attended. You can find a live-stream of tweets from the meeting by searching the #ICHG2011 and #ICHG hashtags.

On Wednesday, Marylyn Ritchie(@MarylynRitchie) and Nancy Cox organized “Beyond Genome-wide association studies”. Nancy Cox presented some ideas on how to integrate multiple “intermediate” associations for SNPs, such as expression QTLs and newly discovered protein QTLs (More on pQTLs later). This approach which she called a Functional Unit Analysis would group signals together based on the genes they influence. Nicholas Shork presented some nice examples of pros and cons of sequence level annotation algorithms. Trey Idekker gave a very nice talk illustrating some of the properties of epistasis in yeast protein interaction networks. One of the more striking points he made was that epistasis tends to occur between full protein complexes rather than within elements of the complexes themselves. Marylyn Ritchie presented the ideas behind her ATHENA software for machine learning analysis of genetic data, and Manuel Mattesian from Tim Becker’s group presented the methods in their INTERSNP software for doing large-scale interaction analysis. What was most impressive with this session is that there were clear attempts to incorporate underlying biological complexity into data analysis.

On Thursday, I attended the second Statistical Genetics section called “Expanding Genome-wide Association Studies”, organized by Saurabh Ghosh and Daniel Shriner. Having recently attended IGES, I feel pretty “up” on newer analysis techniques, but this session had a few talks that sparked my interest. The first three talks were related to haplotype phasing and the issues surrounding computational accuracy and speed. The basic goal of all these methods is to efficiently estimate genotypes for a common set of loci for all samples of a study using a set of reference haplotypes, usually from the HapMap or 1000 genomes data. Despite these advances, it seems like phasing haplotypes for thousands of samples is still a massive undertaking that requires a high-performance computing cluster. There were several talks about ongoing epidemiological studies, including the Kaiser Permanente UCSF cohort. Neil Risch presented an elegant study design implementing four custom GWAS chips for the four targeted populations. Looks like the data hasn't started to flow from this yet, but when it does we’re sure to learn about lots of interesting ethnic-specific disease effects. My good friend and colleague Dana Crawford presented an in silico GWAS study of hypothyroidism. In her best NPR voice, Dana showed how electronic medical records with GWAS data in the EMERGE network can be re-used to construct entirely new studies nested within the data collected for other specific disease purposes. Her excellent Post-Doc, Logan Dumitrescu presented several gene-environment interactions between Lipid levels and vitamin A and E from Dana’s EAGLE study. Finally Paul O’Reilly presented a cool new way to look at multiple phenotypes by essentially flipping a typical regression equation around, estimating coefficients that relate each phenotype in a study to a single SNP genotype as an outcome. This rather clever approach called MultiPhen is similar to log-linear models I’ve seen used for transmission-based analysis, and allows you to model the “interaction” among phenotypes in much the same way you would look at SNP interactions.

By far the most interesting talks of the meeting (for me) were in the Genomics section on Gene Expression, organized by Tomi Pastinen and Mark Corbett. Chris Mason started the session off with a fantastic demonstration of the power of RNA-seq. Examining transcriptomes of 14 non-human primate species, they validated many of the computational predictions in the AceView gene build, and illustrated that most “exome” sequencing is probably examining less than half of all transcribed sequences. Rupali Patwardhan talked about a system for examining the impact of promoter and enhancer mutations in whole mice, essentially using mutagenesis screens to localize these regions. Ron Hause presented work on the protein QTLs that Nancy Cox alluded to earlier in the conference. Using a high-throughput form of western blots, they systematically examined levels for over 400 proteins in the Yoruba HapMap cell lines. They also illustrate that only about 50% of eQTLs identified in these lines actually alter protein levels. Stephen Montgomery spoke about the impact of rare genetic variants within a transcript on transcript levels. Essentially he showed an epistatic effect on expression, where transcripts with deleterious alleles are less likely to be expressed – an intuitive and fascinating finding, especially for those considering rare-variant analysis. Athma Pai presented a new QTL that influences mRNA decay rates. By measuring multiple time points using RNA-seq, she found individual-level variants that alter decay, which she calls dQTLs. Veronique Adoue looked at cis-eQTLs relative to transcription factor binding sites using ChIP, and Alfonso Buil showed how genetic variants influence gene expression networks (or correlation among gene expression) across tissue types.

I must say despite all the awesome work presented in this session, Michael Snyder stole the show with his talk on the “Snyderome” – his own personal –omics profile collected over 21 months. His whole-genome was sequenced by Complete Genomics, and processed using Rong Chen and Atul Butte’s risk-o-gram to quantify his disease risk. His profile predicted increased risk of T2D, so he began collecting glucose measures and low and behold, he saw a sustained spike in blood glucose levels following a few days following a common cold. His interpretation was that an environmental stress knocked him into a pseudo-diabetic state, and his transcriptome and proteome results corroborated this idea. Granted, this is an N of 1, and there is still lots of work to be done before this type of analysis revolutionizes medicine, but the take home message is salient – multiple -omics are better than one, and everyone’s manifestation of a complex disease is different. This was truly thought-provoking work, and it nicely closed an entire session devoted to understanding the intermediate impact of genetic variants to better understand disease complexity.

This is just my take of a really great meeting -- I'm sure I missed lots of excellent talks. If you saw something good please leave a comment and share!

Getting Genetics Done

This blog has moved!