How many reads do I need? What's my sequencing depth? These are common questions I get all the time. Calculating how much sequence data you need to hit a target depth of coverage, or the inverse, what's the coverage depth given a set amount of sequencing, are both easy to answer with some basic algebra. Given one or the other, plus the genome size and read length/configuration, you can calculate either. This was inspired by a similar calculator written by James Hadfield, and was an opportunity for me to create my first Shiny app.
Check out the app here:
http://apps.bioconnector.virginia.edu/covcalc/
And the source code on GitHub:
https://github.com/stephenturner/covcalc
Give it your read length, whether you're using single- or paired-end sequencing, select a genome or enter your own. Then, select whether you want to calculate (a) the number of reads you need to hit a target depth of coverage, or (b) the coverage depth you'll hit given a set number of sequencing reads. Once you make the selection, use the slider to adjust either the desired coverage or number of reads sequenced, and the output text below is automatically updated.
Shiny App: Coverage / Read Count Calculator
Showing posts with label Web Apps. Show all posts
Showing posts with label Web Apps. Show all posts
Wednesday, June 1, 2016
Monday, December 14, 2015
GRUPO: Shiny App For Benchmarking Pubmed Publication Output
This is a guest post from VP Nagraj, a data scientist embedded within UVA’s Health Sciences Library, who runs our Data Analysis Support Hub (DASH) service.
The What
GRUPO (Gauging Research University Publication Output) is a Shiny app that provides side-by-side benchmarking of American research university publication activity.
The How
The code behind the app is written in R, and leverages the NCBI Eutils API via the rentrez package interface.
The methodology is fairly simple:
- Build the search query in Pubmed syntax based on user input parameters.
- Extract total number of articles from results.
- Output a visualization of the total counts for both selected institutions.
- Extract unique article identifiers from results.
- Output the number of article identifiers that match (i.e. “collaborations”) between the two selected institutions.
Build Query
The syntax for the searching Pubmed relies on MEDLINE tags and boolean operators. You can peek into how to use the keywords and build these kinds of queries with the Pubmed Advanced Search Builder.
GRUPO builds its queries based on two fields in particular: “Affiliation” and “Date.” Because this search term will have to be built multiple times (at least twice to compare results for two institutions) I wrote a helper function called
build_query():# use %y/%m/%d (e.g. 1999/02/14) date format for startDate and endDate arguments
build_query = function(institution, startDate, endDate) {
if (grepl("-", institution)==TRUE) {
split_name = strsplit(institution, split="-")
search_term = paste(split_name[[1]][1], '[Affiliation]',
' AND ',
split_name[[1]][2],
'[Affiliation]',
' AND ',
startDate,
'[PDAT] : ',
endDate,
'[PDAT]',
sep='')
search_term = gsub("-","/",search_term)
} else {
search_term = paste(institution,
'[Affiliation]',
' AND ',
startDate,
'[PDAT] : ',
endDate,
'[PDAT]',
sep='')
search_term = gsub("-","/",search_term)
}
return(search_term)
}
The if/else logic in there accommodates cases like “University of North Carolina-Chapel Hill”, which otherwise wouldn’t search properly in the affiliation field. This method does depend on the institution name having its specific locale separated by a
- symbol. In other words, if you passed in “University of Colorado/Boulder” you’d be stuck.
So by using this function for the University of Virginia from January 1, 2014 to January 1, 2015 you’d get the following term:
University of Virginia[Affiliation] AND 2014/01/01[PDAT] : 2015/01/01[PDAT]
And for University of Texas-Austin over the same dates you get the following term:
University of Texas[Affiliation] AND Austin[Affiliation] AND 2014/01/01[PDAT] : 2015/01/01[PDAT]
The advantage of using this function in a Shiny app is that you can pass the institution names and dates dynamically. Users enter the input parameters for which date range and institutions to search via the widgets in the ui.R script.
For the app to work, there has to be one date picker widget and two text inputs (one for each of the two institutions) in the ui.R script. The corresponding server.R script would have a reactive element wrapped around the following:
search_term = build_query(institution = input$institution1, startDate = input$dates[1], endDate = input$dates[2])
search_term2 = build_query(institution = input$institution2, startDate = input$dates[1], endDate = input$dates[2])
### Run Query
With the query built, you can run the search in Pubmed. The
entrez_search() function from the rentrez package lets us get the information we want. This function returns four elements: - ids (unique Pubmed identifiers for each article in the result list)
- count (total number of results)
- retmax (maximum number of results that could have been returned)
- file (the actual XML record containing the values above)
The following code returns total articles for each of two different searches:
affiliation_search = entrez_search("pubmed", search_term, retmax = 99999)
affiliation_search2 = entrez_search("pubmed", search_term2, retmax = 99999)
total_articles = as.numeric(affiliation_search$count)
total_articles2 = as.numeric(affiliation_search2$count)
Plot Results
The code above lives in the server.R script and is the functional workhorse for the app. But to adequately represent the benchmarking, GRUPO needed some kind of plot.
We can combine the total articles for each institution with the institution names, which we used to build the search terms. The result is a tiny (2 x 2) data frame of “Institution” and “Total.Articles” variables. Nothing fancy. But it does the trick.
With a data frame in hand, we can load it into ggplot2 and do some very simple barplotting:
Output Collaborations
Although the primary function of GRUPO is side-by-side benchmarking, it does have at least one other feature so far.
The inclusion of the “ids” object in the query result makes it possible to do something else. You can compare how many of the article identifiers match between two queries. That should represent the number of “collaborations” (i.e. how many of the publications share authorship) between individuals at the two institutions.
To get the total number of collaborations, we can do a simple calculation of length on the vector of intersections between the two search results:
collaboration_count = length(intersect(affiliation_search$ids,affiliation_search2$ids)
By placing the search call inside a reactive element within Shiny, GRUPO can store the results (“count” and “ids”) rather than repeating the query for each purpose.
NB This approach to assessing collaboration counts is spurious when considering articles published before October 2013, which was when the National Library of Medicine (NLM) began including affiliation tags for all authors.
The Next Steps
What’s next? There are a number of potential new features for GRUPO. It’s worth pointing out that a discussion of these possibilities will likely highlight some of the limitations of the app as it exists now.
For example, it would be advantageous to include other “research output” data sources. GRUPO currently only accounts for publications indexed in Pubmed. That’s a fairly one-dimensional representation of scholarly activities. Information about publications indexed elsewhere, funding awarded or altmetric indicators isn’t accounted for.
And neither is any information about the institutions. While all of them are considered to have very high research activity one could argue that some are “apples” and some are “oranges” based on discrepancies in budgets, number of faculty members, student body size, etc. A more thorough benchmarking tool might model research universities based on additional administrative data, and restrict comparisons to “similar” institutions.
So GRUPO is still a work in progress. But it’s a solid example of a Shiny app that effectively leverages an API as its primary data source. Feel free to post a comment if you have any feedback or questions.
Grupo Shiny App: http://apps.bioconnector.virginia.edu/grupo/
Grupo Source Code: https://github.com/vpnagraj/grupo
Friday, April 10, 2015
Translational Bioinformatics Year In Review
Per tradition, Russ Altman gave his "Translational Bioinformatics: The Year in Review" presentation at the close of the AMIA Joint Summit on Translational Bioinformatics in San Francisco on March 26th. This year, papers came from six key areas (and a final Odds and Ends category). His full slide deck is available here.
I always enjoy this talk because it routinely points me to new collections of data and new software tools that are useful for a variety of analyses; as such, I thought I would highlight these resources from his talk this year.
GRASP: analysis of genotype-phenotype results from1390 genome-wide association studies and corresponding open access database
Some of you may have accessed the Johnson and O'Donnell catalog of GWAS results published in 2009. This data set was a more extensive collection of GWAS findings than the popular NHGRI GWAS catalog, as it did not impose a genome-wide significance threshold for reported associations. The GRASP database is a similar effort, reporting numerous attributes of each study.
A zip archive of the full data set (a flat file) is available here.
Effective diagnosis of genetic disease by computational phenotype analysis of the disease associated genome
This paper tackles the enormously complex task of diagnosing rare genetic diseases using a combination of genetic variants (from a VCF file), a list of phenotype characteristics (fed from the Human Phenotype Ontology), and a few other aspects of the disease.
The online tool called PhenIX is available here.
A network based method for analysis of lncRNA disease associations and prediction of lncRNAs implicated in diseases
Here, Yang et al. examine relationships between known long non-coding RNAs and disease using graph propagation. Their underlying database, however, was generated using PubMed mining along with some manual curation.
Their lncRNA-Disease database is available here.
SNPsea: an algorithm to identify cell types, tissuesand pathways affected by risk loci
This tool is a type of SNP set enrichment, designed to specifically look at functional enrichment in the context of specific tissues and cell types. The tool is a C++ executable, available for download here.
The data sources underlying the SNPsea algorithm are available here.
Human symptoms-disease network
Here Zhou et al. systematically extract symptom-to-disease network by exploting MeSH annotations. They compiled a list of 322 symptoms and 4,442 diseases from the MeSH vocabulary, and document their occurrence within PubMed. Using this disease-symptom network, the authors explore the biological underpinnings of certain symptoms by looking at shared genomic elements between diseases with similar symptoms.
The full list of ~130,000 edges in their disease-symptom network is available here.
A circadian gene expression atlas in mammals: implications for biology and medicine
This fascinating paper explores the temporal impact on gene expression traits from 12 mouse organs. By systematically collecting transcriptome data from these tissues at two hour intervals, the authors construct a temporal atlas of gene expression, and show that 43% of proteins have a circadian expression profile.
The accompanying CircaDB database is available online here.
dRiskKB: a large-scale disease-disease riskrelationship knowledge base constructed frombiomedical text
The authors of dRiskKB use text mining across MEDLINE citations using a controlled disease vocabulary, in this case the Human Disease Ontology, to generate pairs of diseases that co-occur with specific patterns in abstract text. These pairs are ranked with a scoring algorithm and provide a new resource for disease co-morbidity relationships.
The flat file data driving dRiskKB can be found online here.
A tissue-based map of the human proteome
In this major effort, a group of investigators have published the most detailed atlas of human protein expression to date. The transcriptome has been extensively studied across human tissues, but it remains unclear to what extent transcriptional activity reflects translation into protein. But most importantly, the data are searchable via a beautiful website.
The underlying data from the Human Protein Atlas is available here.
I always enjoy this talk because it routinely points me to new collections of data and new software tools that are useful for a variety of analyses; as such, I thought I would highlight these resources from his talk this year.
GRASP: analysis of genotype-phenotype results from1390 genome-wide association studies and corresponding open access database
Some of you may have accessed the Johnson and O'Donnell catalog of GWAS results published in 2009. This data set was a more extensive collection of GWAS findings than the popular NHGRI GWAS catalog, as it did not impose a genome-wide significance threshold for reported associations. The GRASP database is a similar effort, reporting numerous attributes of each study.
A zip archive of the full data set (a flat file) is available here.
Effective diagnosis of genetic disease by computational phenotype analysis of the disease associated genome
This paper tackles the enormously complex task of diagnosing rare genetic diseases using a combination of genetic variants (from a VCF file), a list of phenotype characteristics (fed from the Human Phenotype Ontology), and a few other aspects of the disease.
The online tool called PhenIX is available here.
A network based method for analysis of lncRNA disease associations and prediction of lncRNAs implicated in diseases
Here, Yang et al. examine relationships between known long non-coding RNAs and disease using graph propagation. Their underlying database, however, was generated using PubMed mining along with some manual curation.
Their lncRNA-Disease database is available here.
SNPsea: an algorithm to identify cell types, tissuesand pathways affected by risk loci
This tool is a type of SNP set enrichment, designed to specifically look at functional enrichment in the context of specific tissues and cell types. The tool is a C++ executable, available for download here.
The data sources underlying the SNPsea algorithm are available here.
Human symptoms-disease network
Here Zhou et al. systematically extract symptom-to-disease network by exploting MeSH annotations. They compiled a list of 322 symptoms and 4,442 diseases from the MeSH vocabulary, and document their occurrence within PubMed. Using this disease-symptom network, the authors explore the biological underpinnings of certain symptoms by looking at shared genomic elements between diseases with similar symptoms.
The full list of ~130,000 edges in their disease-symptom network is available here.
A circadian gene expression atlas in mammals: implications for biology and medicine
This fascinating paper explores the temporal impact on gene expression traits from 12 mouse organs. By systematically collecting transcriptome data from these tissues at two hour intervals, the authors construct a temporal atlas of gene expression, and show that 43% of proteins have a circadian expression profile.
The accompanying CircaDB database is available online here.
dRiskKB: a large-scale disease-disease riskrelationship knowledge base constructed frombiomedical text
The authors of dRiskKB use text mining across MEDLINE citations using a controlled disease vocabulary, in this case the Human Disease Ontology, to generate pairs of diseases that co-occur with specific patterns in abstract text. These pairs are ranked with a scoring algorithm and provide a new resource for disease co-morbidity relationships.
The flat file data driving dRiskKB can be found online here.
A tissue-based map of the human proteome
In this major effort, a group of investigators have published the most detailed atlas of human protein expression to date. The transcriptome has been extensively studied across human tissues, but it remains unclear to what extent transcriptional activity reflects translation into protein. But most importantly, the data are searchable via a beautiful website.
The underlying data from the Human Protein Atlas is available here.
Tags:
Bioinformatics,
Conferences,
Data Science,
Databases,
Web Apps
Tuesday, April 8, 2014
Unsuck your writing
I recently found this little gem of a web app that analyzes the clarity of your writing. Hemingway highlights long, complex, and hard to read sentences. It also highlights complex words where a simple one would do, and highlights adverbs, suggesting you use a stronger verb instead. It highlights passive voice (bad!), and tells you the minimum reading grade level necessary to understand your writing.
When I pasted in some text from an abstract I submitted to ASHG years ago it showed me just how terrible and difficult to understand my scientific writing really is. My abstract text, which should have been hard-hitting and easy to understand at a glance, required a minimum grade 20 reading level. The majority of my 14 sentences were very hard to read and littered with too many adverbs, complicated words, and several uses of passive voice. (I still got a talk out of the submission, so maybe we as scientists enjoy reading tortuous verbiage...).
It looks like a desktop version is in the works, but the web app seemed to work fine, even for a 100,000-word manuscript I tried.
http://www.hemingwayapp.com/
When I pasted in some text from an abstract I submitted to ASHG years ago it showed me just how terrible and difficult to understand my scientific writing really is. My abstract text, which should have been hard-hitting and easy to understand at a glance, required a minimum grade 20 reading level. The majority of my 14 sentences were very hard to read and littered with too many adverbs, complicated words, and several uses of passive voice. (I still got a talk out of the submission, so maybe we as scientists enjoy reading tortuous verbiage...).
It looks like a desktop version is in the works, but the web app seemed to work fine, even for a 100,000-word manuscript I tried.
http://www.hemingwayapp.com/
Friday, June 7, 2013
ENCODE ChIP-Seq Significance Tool: Which TFs Regulate my Genes?
I collaborate with several investigators on gene expression projects using both microarray and RNA-seq. After I show a collaborator which genes are dysregulated in a particular condition or tissue, the most common question I get is "what are the transcription factors regulating these genes?"
This isn't the easiest question to answer. You could look at transcription factor binding site position weight matrices like those from TRANSFAC and come up with a list of all factors that potentially hit that site, then perform some kind of enrichment analysis on that. But this involves some programming, and is based solely on sequence motifs, not experimental data.
The ENCODE consortium spent over $100M and generated hundreds of ChIP-seq experiments for different transcription factors and histone modifications across many cell types (if you don't know much about ENCODE, go read the main ENCODE paper, and Sean Eddy's very fair commentary). Regardless of what you might consider "biologically functional", the ENCODE project generated a ton of data, and much of this data is publicly available. But that still doesn't help answer our question, because genes are often bound by multiple TFs, and TFs can bind many regions. We need to perform an enrichment (read: hypergeometric) test to assess an over-representation of experimentally bound transcription factors around our gene targets of interest ("around" also implies that some spatial boundary must be specified). To date, I haven't found a good tool to do this easily.
Raymond Auerbach and Bin Chen in Atul Butte's lab recently developed a resource to address this very common need, called the ENCODE ChIP-Seq Significance Tool.
The paper: Auerbach et al. Relating Genes to Function: Identifying Enriched Transcription Factors using the ENCODE ChIP-Seq Significance Tool. Bioinformatics (2013): 10.1093/bioinformatics/btt316.
The software: ENCODE ChIP-Seq Significance Tool (http://encodeqt.stanford.edu/).
This tool takes a list of "interesting" (significant, dysregulated, etc.) genes as input, and identifies ENCODE transcription factors from this list. Head over to http://encodeqt.stanford.edu/, select the ID type you're using (Ensembl, Symbol, etc), and paste in your list of genes. You can also specify your background set (this has big implications for the significance testing using the hypergeometric distribution). Scroll down some more to tell the tool how far up and downstream you want to look from the transcription start/end site or whole gene, select an ENCODE cell line (or ALL), and hit submit.
You're then presented with a list of transcription factors that are most likely regulating your input genes (based on overrepresentation of ENCODE ChIP-seq binding sites). Your results can then be saved to CSV or PDF. You can also click on a number in the results table and get a list of genes that are regulated by a particular factor (the numbers do not appear as hyperlinks in my browser, but clicking the number still worked).
At the very bottom of the page, you can load example data that they used in the supplement of their paper, and run through the analysis presented therein. The lead author, Raymond Auerbach, even made a very informative screencast on how to use the tool:
Now, if I could only find a way to do something like this with mouse gene expression data.
Tags:
Bioinformatics,
RNA-Seq,
Software,
Web Apps
Wednesday, February 20, 2013
NetGestalt for Data Visualization in the Context of Pathways
Many of you may be familiar with WebGestalt, a wonderful web utility developed by Bing Zhang at Vanderbilt for doing basic gene-set enrichment analyses. Last year, we invited Bing to speak at our annual retreat for the Vanderbilt Graduate Program in Human Genetics, and he did not disappoint! Bing walked us through his new tool called NetGestalt.
NetGestalt provides users with the ability to overlay large-scale experimental data onto biological networks. Data are loaded using continuous and binary tracks that can contain either single or multiple lines of data (called composite tracks). Continuous tracks could be gene expression intensities from microarray data or any other quantitative measure that can be mapped to the genome. Binary tracks are usually insertion/deletion regions, or called regions like ChIP peaks. NetGestalt extends many of the features of WebGestalt, including enrichment analysis for modules within a biological network, and provides easy ways to visualize the overlay of multiple tracks with Venn diagrams.
Netgestalt provides a very nice interface for interacting with data. Extensive documentation on how to use it can be found here. Bing and his colleagues also went the extra mile to create video tutorials on how to use their web tool, and walk you through an analysis of some tumor data.
http://www.netgestalt.org/
NetGestalt provides users with the ability to overlay large-scale experimental data onto biological networks. Data are loaded using continuous and binary tracks that can contain either single or multiple lines of data (called composite tracks). Continuous tracks could be gene expression intensities from microarray data or any other quantitative measure that can be mapped to the genome. Binary tracks are usually insertion/deletion regions, or called regions like ChIP peaks. NetGestalt extends many of the features of WebGestalt, including enrichment analysis for modules within a biological network, and provides easy ways to visualize the overlay of multiple tracks with Venn diagrams.
Netgestalt provides a very nice interface for interacting with data. Extensive documentation on how to use it can be found here. Bing and his colleagues also went the extra mile to create video tutorials on how to use their web tool, and walk you through an analysis of some tumor data.
http://www.netgestalt.org/
Tags:
Pathways,
Tutorials,
Visualization,
Web Apps
Monday, January 28, 2013
Scotty, We Need More Power! Power, Sample Size, and Coverage Estimation for RNA-Seq
Two of the most common questions at the beginning of an RNA-seq experiments are "how many reads do I need?" and "how many replicates do I need?". This paper describes a web application for designing RNA-seq applications that calculates an appropriate sample size and read depth to satisfy user-defined criteria such as cost, maximum number of reads or replicates attainable, etc. The power and sample size estimations are based on a t-test, which the authors claim, performs no worse than the negative binomial models implemented by popular RNA-seq methods such as DESeq, when there are three or more replicates present. Empirical distributions are taken from either (1) pilot data that the user can upload, or (2) built in publicly available data. The authors find that there is substantial heterogeneity between experiments (technical variation is larger than biological variation in many cases), and that power and sample size estimation will be more accurate when the user provides their own pilot data.
My only complaint, for all the reasons expressed in my previous blog post about why you shouldn't host things like this exclusively on your lab website, is that the code to run this analysis doesn't appear to be available to save, study, modify, maintain, or archive. When lead author Michele Busby leaves Gabor Marth's lab, hopefully the app doesn't fall into the graveyard of computational biology web apps. Update 2/7/13: Michele Busby created a public Github repository for the Scotty code: https://github.com/mbusby/Scotty
tl;dr? There's a new web app that does power, sample size, and coverage calculations for RNA-seq, but it only works well if the pilot or public data you give it closely matches the actual data you'll collect.
Web app: http://euler.bc.edu/marthlab/scotty/scotty.php
Source code: https://github.com/mbusby/Scotty
Source code: https://github.com/mbusby/Scotty
Tags:
Bioinformatics,
RNA-Seq,
Sequencing,
Statistics,
Web Apps
Tuesday, January 8, 2013
Stop Hosting Data and Code on your Lab Website
It's happened to all of us. You read about a new tool, database, webservice, software, or some interesting and useful data, but when you browse to http://instititution.edu/~home/professorX/lab/data, there's no trace of what you were looking for.
THE PROBLEM
This isn't an uncommon problem. See the following two articles:
THE PROBLEM
This isn't an uncommon problem. See the following two articles:
Schultheiss, Sebastian J., et al. "Persistence and availability of web services in computational biology." PLoS one 6.9 (2011): e24914.
Wren, Jonathan D. "404 not found: the stability and persistence of URLs published in MEDLINE." Bioinformatics 20.5 (2004): 668-672.The first gives us some alarming statistics. In a survey of nearly 1000 web services published in the Nucleic Acids Web Server Issue between 2003 and 2009:
- Only 72% were still available at the published address.
- The authors could not test the functionality for 33% because there was no example data, and 13% no longer worked as expected.
- The authors could only confirm positive functionality for 45%.
- Only 274 of the 872 corresponding authors answered an email.
- Of these 78% said a service was developed by a student or temporary researcher, and many had no plan for maintenance after the researcher had moved on to a permanent position.
The Wren et al. paper found that of 1630 URLs identified in Pubmed abstracts, only 63% were consistently available. That rate was far worse for anonymous login FTP sites (33%).
OpenHelix recently started this thread on Biostar as an obituary section for bioinformatics tools and resources that have vanished.
It's a fact that most of us academics move around a fair amount. Often we may not deem a tool we developed or data we collected and released to be worth transporting and maintaining. After some grace period, the resource disappears without a trace.
SOFTWARE
I won't spend much time here because most readers here are probably aware of source code repositories for hosting software projects. Unless you're not releasing the source code to your software (aside: starting an open-source project is a way to stake a claim in a field, not a real risk for getting yourself scooped), I can think of no benefit for hosting your code on your lab website when there are plenty of better alternatives available, such as Sourceforge, GitHub, Google Code, and others. In addition to free project hosting, tools like these provide version control, wikis, bug trackers, mailing lists and other services to enable transparent and open development with the end result of a better product and higher visibility. For more tips on open scientific software development, see this short editorial in PLoS Comp Bio:
Prlić A, Procter JB (2012) Ten Simple Rules for the Open Development of Scientific Software. PLoS Comput Biol 8(12): e1002802.
Casey Bergman recently analyzed where bioinformaticians are hosting their code, where he finds that the growth rate of Github is outpacing both Google Code and Sourceforge. Indeed, Github hosts more repositories than there are articles in Wikipedia, and has an excellent tutorial and interactive learning modules to help you learn how to use it. However, Bergman also points out how easy it is to delete a repository from Github and Google Code, where repositories are published by individuals who hold the keys to preservation (as opposed to Sourceforge, where it is extremely difficult to remove a project once it's been released).
DATA, FIGURES, SLIDES, WEB SERVICES, OR ANYTHING ELSE
For everything else there's Figshare. Figshare lets you host and publicly share unlimited data (or store data privately up to 1GB). The name suggests a site for sharing figures, but Figshare allows you to permanently store and share any research object. That can be figures, slides, negative results, videos, datasets, or anything else. If you're running a database server or web service, you can package up the source code on one of the repositories mentioned above, and upload to Figshare a virtual machine image of the server running it, so that the service will be available to users long after you've lost the time, interest, or money to maintain it.
Research outputs stored at Figshare are archived in the CLOCKSS geographically and geopolitically distributed network of redundant archive nodes, located at 12 major research libraries around the world. This means that content will remain available indefinitely for everyone after a "trigger event," and ensures this work will be maximally accessible and useful over time. Figshare is hosted using Amazon Web Services to ensure the highest level of security and stability for research data.
Upon uploading your data to Figshare, your data becomes discoverable, searchable, shareable, and instantly citable with its own DOI, allowing you to instantly take credit for the products of your research.
To show you how easy this is, I recently uploaded a list of "consensus" genes generated by Will Bush where Ensembl refers to an Entrez-gene with the same coordinates, and that Entrez-gene entry refers back to the same Ensembl gene (discussed in more detail in this previous post).
Create an account, and hit the big upload link. You'll be given a screen to drag and drop anything you'd like here (there's also a desktop uploader for larger files).
Once I dropped in the data I downloaded from Vanderbilt's website linked from the original blog post, I enter some optional metadata, a description, a link back to the original post:
I then instantly receive a citeable DOI where the data is stored permanently, regardless of Will's future at Vanderbilt:
Ensembl/Entrez hg19/GRCh37 Consensus Genes. Stephen Turner. figshare. Retrieved 21:31, Dec 19, 2012 (GMT). http://dx.doi.org/10.6084/m9.figshare.103113
There are also links to the side that allow you to export that citation directly to your reference manager of choice.
Finally, as an experiment, I also uploaded this entire blog post to Figshare, which is now citeable and permanently archived at Figshare:
Stop Hosting Data and Code on your Lab Website. Stephen Turner. figshare. Retrieved 22:51, Dec 19, 2012 (GMT). http://dx.doi.org/10.6084/m9.figshare.105125.
Wednesday, August 1, 2012
Cscan: Finding Gene Expression Regulators with ENCODE ChIP-Seq Data
Recently published in Nucleic Acids Research:
F. Zambelli, G. M. Prazzoli, G. Pesole, G. Pavesi, Cscan: finding common regulators of a set of genes by using a collection of genome-wide ChIP-seq datasets., Nucleic acids research 40, W510–5 (2012).
![]() |
| Cscan web interface screenshot |
This paper presents a methodology and software implementation that allows users to discover a set of transcription factors or epigenetic modifications that regulate a set of genes of interest. A wealth of data about transcription factor binding exists in the public domain, and this is a good example of a group utilizing those resources to develop tools that are of use to the broader computational biology community.
High-throughput gene expression experiments like microarrays and RNA-seq experiments often result in a list of differentially regulated or co-expressed genes. A common follow-up question asks which transcription factors may regulate those genes of interest. The ENCODE project has completed ChIP-seq experiments for many transcription factors and epigenetic modifications for a number of different cell lines in both human and model organisms. These researchers crossed this publicly available data on enriched regions from ChIP-seq experiments with genomic coordinates of gene annotations to create a table of gene annotations (rows) by ChIP-peak signals, with a presence/absence peak in each cell. Given a set of genes of interest (e.g. differentially regulated genes from an RNA-seq experiment), the method evaluates the over-/under-representation of target sites for the DNA binding protein in each ChIP experiment using a Fisher's exact test. Other methods based on motif-enrichment (using position weight matrices derived from databases like TRANSFAC or JASPAR) would miss DNA-binding factors like the Retinoblastoma protein (RB), which lacks a DNA-binding domain and is recruited to promoters by other transcription factors. In addition to overcoming this limitation, the method presented here also has the advantage of considering tissue-specificity and chromatin accessibility.
The web interface is free and doesn't require registration: http://www.beaconlab.it/cscan
Wednesday, June 27, 2012
Browsing dbGAP Results
Thanks to the excellent work of Lucia Hindorff and colleagues
at NHGRI, the GWAS catalog provides a great reference for the cumulative
results of GWAS for various phenotypes. Anyone
familiar with GWAS also likely knows about dbGaP – the NCBI repository for genotype-phenotype
relationships – and the wealth of data it contains. While dbGaP is often thought of as a way to
get access to existing genotype data, analysis results are often deposited into
dbGaP as well. Individual-level data
(like genotypes) are generally considered “controlled access”, requiring
special permission to retrieve or use. Summary-level
data, such as association p-values, are a bit more accessible. There are two tools available from the dbGaP
website: the Association Results Browser and the Phenotype-GenotypeIntegrator (PheGenI). These tools provide a search
interface for examining previous GWAS associations.
The Association Results Browser provides a simple table
listing of associations, searchable by SNP, gene, or phenotype. It contains the information from the NHGRI GWAS catalog, as well as additional associations from dbGaP deposited
studies. I’ve shown an example below for
multiple sclerosis. You can restrict the
search to the dbGaP-specific results by changing the “Source” selection. If you are looking for the impact of a SNP,
this is a nice supplement to the catalog.
Clicking on a p-value brings up the GaP browser, which provides a more
graphical (but perhaps less useful) view of the data.
The PheGenI tool provides similar search functionality, but
attempts to provide phenotype categories rather than more specific phenotype
associations. Essentially, phenotype
descriptions are linked to MeSH terms to provide categories such as “Chemicals
and Drugs”, or “Hemic and Lymphatic Diseases”.
PheGenI seems most useful if searching from the phenotype perspective,
while the association browser seems better for SNP or Gene searches. All these tools are under active development, and I look forward to seeing their future versions.
Tags:
Bioinformatics,
Databases,
dbGaP,
GWAS,
Web Apps
Friday, May 14, 2010
LocusZoom v1.1 - Create Regional Plots of GWAS Results
Previously mentioned LocusZoom has undergone some major updates over the last few months. Many of the bugs mentioned in my previous post are now fixed, and now there's a good bit of documentation available. There are also a few new features, including the ability to add an extra column to your results file to change the plotting symbol to reflect your own custom annotation (i.e. whether the SNP was imputed or genotyped, or the SNP's function).
This software is seriously useful for plotting regional association results with the level of detail and annotation that you can't achieve using regular manhattan plots. Go give it a try, now. Also keep an eye out for a downloadable version that should be available in the next week or so.
LocusZoom: Create Regional Plots of GWAS Results
And as a suggestion to the developers: how about a radio button on the web app that would allow you to accept files in PLINK's .assoc/.qassoc format so PLINK users wouldn't have go through the awkward text-wrangling to get their results files in an acceptable format (instructions specific for LocusZoom), also making "P" and "SNP" the default column names for these values?
This software is seriously useful for plotting regional association results with the level of detail and annotation that you can't achieve using regular manhattan plots. Go give it a try, now. Also keep an eye out for a downloadable version that should be available in the next week or so.
LocusZoom: Create Regional Plots of GWAS Results
And as a suggestion to the developers: how about a radio button on the web app that would allow you to accept files in PLINK's .assoc/.qassoc format so PLINK users wouldn't have go through the awkward text-wrangling to get their results files in an acceptable format (instructions specific for LocusZoom), also making "P" and "SNP" the default column names for these values?
Tags:
GWAS,
Visualization,
Web Apps
Monday, February 15, 2010
Keep your lab notebook in a private blog
In my previous post about Q10 a commenter suggested a software called "The Journal" by davidRM for productively keeping track of experiments, datasets, projects, etc. I've never tried this software before, but about a year ago I ditched my pen and paper lab notebook for an electronic lab notebook in the form of a blog using Blogger, the same platform I use to write Getting Genetics Done.
The idea of using a blogging platform for your lab notebook is pretty simple and the advantages are numerous. All your entries are automatically dated appear chronologically. You can view your notebook or make new entries from anywhere in the world. You can copy and paste useful code snippets, upload images and documents, and take advantages of tags and search features present with most blogging platforms. I keep my lab notebook private - search engines can't index it, and I have to give someone permission to view before they can see it. Once you've allowed someone to view your blog/notebook, you can also allow them to comment on posts. This is a great way for my mentor to keep track of result and make suggestions. And I can't count how many times I've gone back to an older notebook entry to view the code that I used to do something quickly in R or PLINK but didn't bother to save anywhere else.
Of course Blogger isn't the only platform that can do this, although it's free and one of the easiest to get set up, especially if you already have a Google account. Wordpress is very similar, and has tons of themes. You can find lots of comparisons between the two online. If you have your own web host, you can install the open-source version of Wordpress on your own host, for added security and access control (see this explanation of the differences between Wordpress.com and Wordpress.org).
Another note-taking platform I've been using recently is Evernote. Lifehacker blog has a great overview of Evernote's features. It runs as a desktop application, and it syncs across all your computers, and online in a web interface also. The free account lets you sync 40MB a month, which is roughly the equivalent of 20,000 typed notes. This quota resets every month, and you start fresh at 40MB. You can also attach PDFs to a note, and link notes to URLs. Every note is full-text searchable.
And then of course there's the non-free option: Microsoft OneNote. Although it will set you back a few bucks, it integrates very nicely with many features on your Windows machine. I've never used OneNote.
The idea of using a blogging platform for your lab notebook is pretty simple and the advantages are numerous. All your entries are automatically dated appear chronologically. You can view your notebook or make new entries from anywhere in the world. You can copy and paste useful code snippets, upload images and documents, and take advantages of tags and search features present with most blogging platforms. I keep my lab notebook private - search engines can't index it, and I have to give someone permission to view before they can see it. Once you've allowed someone to view your blog/notebook, you can also allow them to comment on posts. This is a great way for my mentor to keep track of result and make suggestions. And I can't count how many times I've gone back to an older notebook entry to view the code that I used to do something quickly in R or PLINK but didn't bother to save anywhere else.
Of course Blogger isn't the only platform that can do this, although it's free and one of the easiest to get set up, especially if you already have a Google account. Wordpress is very similar, and has tons of themes. You can find lots of comparisons between the two online. If you have your own web host, you can install the open-source version of Wordpress on your own host, for added security and access control (see this explanation of the differences between Wordpress.com and Wordpress.org).
Another note-taking platform I've been using recently is Evernote. Lifehacker blog has a great overview of Evernote's features. It runs as a desktop application, and it syncs across all your computers, and online in a web interface also. The free account lets you sync 40MB a month, which is roughly the equivalent of 20,000 typed notes. This quota resets every month, and you start fresh at 40MB. You can also attach PDFs to a note, and link notes to URLs. Every note is full-text searchable.
And then of course there's the non-free option: Microsoft OneNote. Although it will set you back a few bucks, it integrates very nicely with many features on your Windows machine. I've never used OneNote.
Tags:
Productivity,
Software,
Web Apps,
Writing
Wednesday, February 10, 2010
LocusZoom: Plot regional association results from GWAS
Update Friday, May 14, 2010: See this newer post on LocusZoom.
If you caught Cristen Willer's seminar here a few weeks ago you saw several beautiful figures in the style of a manhattan plot, but zoomed in around a region of interest, with several other useful information overlays.
Click to take a look at this plot below, showing the APOE region for an Alzheimer's Disease GWAS:
It's a simple plot of the -log10(p-values) for SNPs in a given region, but it also shows:
1. LD information (based on HapMap) shown by color-coded points (not much LD here).
2. Recombination rates (the blue line running through the plot). Peaks are hotspots.
3. Spatial orientation of the SNPs you plotted (running across the top)
3. Genes! The overlay along the bottom shows UCSC genes in the region.
You can very easily take a PLINK output file (or any other format) and make an image like this for your data for any SNP, gene, or region of interest using a tool Cristen and others at Michigan developed called LocusZoom. LocusZoom is written in R with a Python wrapper that works from an easy to use web interface.
All the program needs is a list of SNP names and their associated P-values. If you're using PLINK, your *.assoc or *.qassoc files have this information, but first you'll have to run a quick command to format them. Run this command I discussed in a previous post to convert your PLINK output into a comma delimited CSV file (PLINK's default is irregular whitespace delimited):
Next, you'll want to compress this file so that it doesn't take forever to upload.
Now, upload your new file (plink.assoc.csv.gz) on the LocusZoom website. Tell it that your p-value column is named "P" and your marker column is named "SNP" (or whatever they're called if you're not using PLINK). Change the delimiter type to "comma", then put in a region of interest. I chose APOE, but you could also use a SNP name (include the "rs" before the number). Now hit "Plot your Data," and it should take about a minute.
There are some other options below, but I've had bad luck using any of them. For instance, I can never get it to output a PNG properly - only PDF works the last time I tried it. I also could not successfully make a plot if I turn off the recombination rate overlay. I know this is a very early version, but hopefully they'll clean up some of the code and document some of its features very soon. I could see this being a very useful tool, especially once it's available for download for local use. (Update: some of these bugs have been fixed. See this newer post on LocusZoom).
LocusZoom: Plot regional association results from GWAS
If you caught Cristen Willer's seminar here a few weeks ago you saw several beautiful figures in the style of a manhattan plot, but zoomed in around a region of interest, with several other useful information overlays.
Click to take a look at this plot below, showing the APOE region for an Alzheimer's Disease GWAS:
It's a simple plot of the -log10(p-values) for SNPs in a given region, but it also shows:
1. LD information (based on HapMap) shown by color-coded points (not much LD here).
2. Recombination rates (the blue line running through the plot). Peaks are hotspots.
3. Spatial orientation of the SNPs you plotted (running across the top)
3. Genes! The overlay along the bottom shows UCSC genes in the region.
You can very easily take a PLINK output file (or any other format) and make an image like this for your data for any SNP, gene, or region of interest using a tool Cristen and others at Michigan developed called LocusZoom. LocusZoom is written in R with a Python wrapper that works from an easy to use web interface.
All the program needs is a list of SNP names and their associated P-values. If you're using PLINK, your *.assoc or *.qassoc files have this information, but first you'll have to run a quick command to format them. Run this command I discussed in a previous post to convert your PLINK output into a comma delimited CSV file (PLINK's default is irregular whitespace delimited):
cat plink.assoc | sed -r 's/^\s+//g' | sed -r 's/\s+/,/g' > plink.assoc.csv
Next, you'll want to compress this file so that it doesn't take forever to upload.
gzip plink.assoc.csv
Now, upload your new file (plink.assoc.csv.gz) on the LocusZoom website. Tell it that your p-value column is named "P" and your marker column is named "SNP" (or whatever they're called if you're not using PLINK). Change the delimiter type to "comma", then put in a region of interest. I chose APOE, but you could also use a SNP name (include the "rs" before the number). Now hit "Plot your Data," and it should take about a minute.
There are some other options below, but I've had bad luck using any of them. For instance, I can never get it to output a PNG properly - only PDF works the last time I tried it. I also could not successfully make a plot if I turn off the recombination rate overlay. I know this is a very early version, but hopefully they'll clean up some of the code and document some of its features very soon. I could see this being a very useful tool, especially once it's available for download for local use. (Update: some of these bugs have been fixed. See this newer post on LocusZoom).
LocusZoom: Plot regional association results from GWAS
Tags:
GWAS,
PLINK,
R,
Visualization,
Web Apps
Monday, February 1, 2010
GRAIL: Gene Relationships Across Implicated Loci
If you caught Soumya Raychaudhuri's seminar last week you heard a lot about the tool he developed at the broad called GRAIL - Gene Relationships Across Implicated Loci. You've got GWAS results and now you want to prioritize SNPs to follow up in replication or functional studies. Of course you're going to take your stellar hits at p<10e-8, but what about that fuzzy region between 10e-4 and 10e-8? Here's where a tool like GRAIL may come in handy.
In essence, you feed GRAIL a list of SNPs and it maps these SNPs to gene regions using LD. It then uses a simple text-mining algorithm to ascertain the degree of connectivity among the associated genes by looking at the similarity of vectors of words pulled from PubMed abstracts which mention your gene of interest. In their most recent paper they took a list of 370 GWAS hits, and narrowed this down to a list of 22 candidate SNPs to follow up. And it turns out these SNPs replicated in an independent set at a much higher frequency than random SNPs from the subset of 370. In his talk, Soumya offered convincing evidence that using the results from GRAIL you have a much better shot at replicating associations than if you just looked at the p-value rankings alone. After the talk he did mention that this approach has had mixed success depending on the phenotype. Here's the original GRAIL paper, and the OpenHelix blog has a nice 5-minute video screencast demonstration of GRAIL where they take a few SNPs from the GWAS catalog and run them through GRAIL.
GRAIL is a free web application (beta) available at the Broad's website below.
GRAIL: Gene Relationships Across Implicated Loci
Tags:
Bioinformatics,
GWAS,
Pathways,
Software,
Web Apps
Wednesday, December 16, 2009
Recent improvements to Pubget
If you've never heard of it before, check out my previous coverage on Pubget. It's like PubMed, but you get the PDFs right away. Pubget has recently implemented a number of improvements.
1. Citation matching. Pubget's citation matcher seems to work better than Pubmed most of the time. Try going to Pubget and pasting any of these random citations into the search bar:
J Biol Chem 277: 30738-30745
Nucleic Acids Res 2004;32:4812-20.
Evol. Biol. 7, 214 (2007).
2. The PaperPlane bookmarklet. Go here and drag the link to your bookmark toolbar. Now, if you're searching from pubmed, click the bookmarklet for one-click access to the PDF.
3. If you have a long list of PMIDs, separate them with commas and you can paste them directly into the search bar.
Pubget (Vanderbilt institutional link)
Pubget (If you're anywhere else)
1. Citation matching. Pubget's citation matcher seems to work better than Pubmed most of the time. Try going to Pubget and pasting any of these random citations into the search bar:
J Biol Chem 277: 30738-30745
Nucleic Acids Res 2004;32:4812-20.
Evol. Biol. 7, 214 (2007).
2. The PaperPlane bookmarklet. Go here and drag the link to your bookmark toolbar. Now, if you're searching from pubmed, click the bookmarklet for one-click access to the PDF.
3. If you have a long list of PMIDs, separate them with commas and you can paste them directly into the search bar.
Pubget (Vanderbilt institutional link)
Pubget (If you're anywhere else)
Wednesday, September 23, 2009
JBrowse: a JavaScript Based Genome Browser
Genome Browsers are nothing new, but JBrowse is a new JavaScript based genome browser that uses information from the UCSC genome browser and has the look and feel of Google Maps. It's extremely easy to zoom in and out and scroll around because all the "work" is being done by your computer rather than some server farm thousands of miles away. OpenHelix is calling it a gamechanger, and they have a nice video demonstration showing off some of JBrowse's features. Click the Drosophila or Homo sapiens genome and give JBrowse a spin for yourself!
The JBrowse genome browser
Tags:
Visualization,
Web Apps
Wednesday, August 5, 2009
Pubget = Pubmed on Steroids
The one thing I've found is that they don't index things as quickly as PubMed, so you might have a hard time finding Advance Online Publications using Pubget.
Thursday, May 21, 2009
Has anyone ever used Galaxy?
Has anyone ever used Galaxy? I saw their presentation at last year's ASHG. Seems like a great way to do collaborate on and keep a record of analyses in an easy web-GUI interface without having to download any software. If you've used it for genetic analysis and you'd like to write a bit about it here (whether you're a Vanderbilt person or not), email me or post in the comments.
Tags:
Announcements,
Bioinformatics,
Statistics,
Web Apps
Tuesday, May 19, 2009
Wolfram Alpha as a bioinformatics tool
Just released last week by the makers of Mathematica, Wolfram Alpha is kind of like a search engine, calling itself a "computational knowledge engine," with the lofty goal as a "long-term project to make all systematic knowledge immediately computable by anyone."
From their homepage you can link to a page showing examples of how to use it, but I was interested in seeing how much biology Wolfram Alpha knows, and I've got to say I'm impressed with the results.
(Note: their servers are pretty busy I guess, so if the links don't work the first time, or the search times out, try reloading.)
Check out the results I got when I searched for APOE. It correctly interpreted the fact that I wanted information about the human gene, and accordingly gave me information about the gene and its location, along with a chromosome ideogram, a reference sequence, splice structures, and more.
I was also impressed to see what happened when I entered a random string of ACGT's. It correctly interpreted my query as a nucleotide sequence, told me the amino acid sequence it would make, correctly guessed how often this sequence would be found in the genome if bases occur randomly, and gave me gene names, positions, and ideograms of the places where this sequence is actually found in the human genome.
Finally, I tried searching for a SNP that I have an interest in.
For being only days old, and for not being specifically developed as a bioinformatics tool, it's pretty impressive what it can do already. It should be interesting to see what else they come up with.
From their homepage you can link to a page showing examples of how to use it, but I was interested in seeing how much biology Wolfram Alpha knows, and I've got to say I'm impressed with the results.
(Note: their servers are pretty busy I guess, so if the links don't work the first time, or the search times out, try reloading.)
Check out the results I got when I searched for APOE. It correctly interpreted the fact that I wanted information about the human gene, and accordingly gave me information about the gene and its location, along with a chromosome ideogram, a reference sequence, splice structures, and more.
I was also impressed to see what happened when I entered a random string of ACGT's. It correctly interpreted my query as a nucleotide sequence, told me the amino acid sequence it would make, correctly guessed how often this sequence would be found in the genome if bases occur randomly, and gave me gene names, positions, and ideograms of the places where this sequence is actually found in the human genome.
Finally, I tried searching for a SNP that I have an interest in.
For being only days old, and for not being specifically developed as a bioinformatics tool, it's pretty impressive what it can do already. It should be interesting to see what else they come up with.
Tags:
Bioinformatics,
Search,
Web Apps
Friday, April 24, 2009
Genes2Networks
I saw a demonstration of this tool at the workshop on network analysis I announced last week. Genes2Networks draws from a large background network consisting of several experimentally verified mammalian protein interaction databases. It will take a list of genes you provide as seed genes and identify all interacting genes that fall on paths through the background network between them. It then calculates a z-score for each intermediate node based on the number of links to the seed nodes and the number of total links it has in the background network, and color-codes the significant links.
It's free, and it took about 30 seconds to paste in and analyze a list of a few dozen genes I submitted. It looks like a great way to identify potential interaction partners for some genes you've already found. You can also export the networks you find to a text file that can be loaded with other software developed by this lab for more extensive statistical network analysis and visualization.
Genes2Networks
Publication in BMC Bioinformatics
Avi Ma'ayan's Lab Website
UPDATE 2009-04-27: I have slides from three one-hour lectures that Avi gave on network analysis following his seminar. Email me if you want them.
It's free, and it took about 30 seconds to paste in and analyze a list of a few dozen genes I submitted. It looks like a great way to identify potential interaction partners for some genes you've already found. You can also export the networks you find to a text file that can be loaded with other software developed by this lab for more extensive statistical network analysis and visualization.
Genes2Networks
Publication in BMC Bioinformatics
Avi Ma'ayan's Lab Website
UPDATE 2009-04-27: I have slides from three one-hour lectures that Avi gave on network analysis following his seminar. Email me if you want them.
Tags:
Bioinformatics,
Web Apps
Subscribe to:
Posts (Atom)














