Tuesday, January 8, 2013

Stop Hosting Data and Code on your Lab Website

It's happened to all of us. You read about a new tool, database, webservice, software, or some interesting and useful data, but when you browse to http://instititution.edu/~home/professorX/lab/data, there's no trace of what you were looking for.

THE PROBLEM

This isn't an uncommon problem. See the following two articles:
Schultheiss, Sebastian J., et al. "Persistence and availability of web services in computational biology." PLoS one 6.9 (2011): e24914. 
Wren, Jonathan D. "404 not found: the stability and persistence of URLs published in MEDLINE." Bioinformatics 20.5 (2004): 668-672.
The first gives us some alarming statistics. In a survey of nearly 1000 web services published in the Nucleic Acids Web Server Issue between 2003 and 2009:
  • Only 72% were still available at the published address.
  • The authors could not test the functionality for 33% because there was no example data, and 13% no longer worked as expected.
  • The authors could only confirm positive functionality for 45%.
  • Only 274 of the 872 corresponding authors answered an email.
  • Of these 78% said a service was developed by a student or temporary researcher, and many had no plan for maintenance after the researcher had moved on to a permanent position.
The Wren et al. paper found that of 1630 URLs identified in Pubmed abstracts, only 63% were consistently available. That rate was far worse for anonymous login FTP sites (33%).

OpenHelix recently started this thread on Biostar as an obituary section for bioinformatics tools and resources that have vanished.

It's a fact that most of us academics move around a fair amount. Often we may not deem a tool we developed or data we collected and released to be worth transporting and maintaining. After some grace period, the resource disappears without a trace. 

SOFTWARE

I won't spend much time here because most readers here are probably aware of source code repositories for hosting software projects. Unless you're not releasing the source code to your software (aside: starting an open-source project is a way to stake a claim in a field, not a real risk for getting yourself scooped), I can think of no benefit for hosting your code on your lab website when there are plenty of better alternatives available, such as Sourceforge, GitHub, Google Code, and others. In addition to free project hosting, tools like these provide version control, wikis, bug trackers, mailing lists and other services to enable transparent and open development with the end result of a better product and higher visibility. For more tips on open scientific software development, see this short editorial in PLoS Comp Bio:

Prlić A, Procter JB (2012) Ten Simple Rules for the Open Development of Scientific Software. PLoS Comput Biol 8(12): e1002802. 

Casey Bergman recently analyzed where bioinformaticians are hosting their code, where he finds that the growth rate of Github is outpacing both Google Code and Sourceforge. Indeed, Github hosts more repositories than there are articles in Wikipedia, and has an excellent tutorial and interactive learning modules to help you learn how to use it. However, Bergman also points out how easy it is to delete a repository from Github and Google Code, where repositories are published by individuals who hold the keys to preservation (as opposed to Sourceforge, where it is extremely difficult to remove a project once it's been released).

DATA, FIGURES, SLIDES, WEB SERVICES, OR ANYTHING ELSE

For everything else there's Figshare. Figshare lets you host and publicly share unlimited data (or store data privately up to 1GB). The name suggests a site for sharing figures, but Figshare allows you to permanently store and share any research object. That can be figures, slides, negative results, videos, datasets, or anything else. If you're running a database server or web service, you can package up the source code on one of the repositories mentioned above, and upload to Figshare a virtual machine image of the server running it, so that the service will be available to users long after you've lost the time, interest, or money to maintain it.

Research outputs stored at Figshare are archived in the CLOCKSS geographically and geopolitically distributed network of redundant archive nodes, located at 12 major research libraries around the world. This means that content will remain available indefinitely for everyone after a "trigger event," and ensures this work will be maximally accessible and useful over time. Figshare is hosted using Amazon Web Services to ensure the highest level of security and stability for research data. 

Upon uploading your data to Figshare, your data becomes discoverable, searchable, shareable, and instantly citable with its own DOI, allowing you to instantly take credit for the products of your research. 

To show you how easy this is, I recently uploaded a list of "consensus" genes generated by Will Bush where Ensembl refers to an Entrez-gene with the same coordinates, and that Entrez-gene entry refers back to the same Ensembl gene (discussed in more detail in this previous post).

Create an account, and hit the big upload link. You'll be given a screen to drag and drop anything you'd like here (there's also a desktop uploader for larger files).



Once I dropped in the data I downloaded from Vanderbilt's website linked from the original blog post, I enter some optional metadata, a description, a link back to the original post:



I then instantly receive a citeable DOI where the data is stored permanently, regardless of Will's future at Vanderbilt:

Ensembl/Entrez hg19/GRCh37 Consensus Genes. Stephen Turner. figshare. Retrieved 21:31, Dec 19, 2012 (GMT). http://dx.doi.org/10.6084/m9.figshare.103113

There are also links to the side that allow you to export that citation directly to your reference manager of choice.

Finally, as an experiment, I also uploaded this entire blog post to Figshare, which is now citeable and permanently archived at Figshare:

Stop Hosting Data and Code on your Lab Website. Stephen Turner. figshare. Retrieved 22:51, Dec 19, 2012 (GMT). http://dx.doi.org/10.6084/m9.figshare.105125.

8 comments:

  1. This is a great post and it touches on some inherent weaknesses in bioinformatics databases and web applications: they are quickly dated, they depend on centralized support, and they are simply not suited to modern high-throughput analyses.

    The solution is to make the data and tools themselves reproducible, so that motivated people have a better starting point than a graveyard of dead websites.

    ReplyDelete
  2. "which is now citeable" ...

    Do you mean to say that things without a DOI aren't citeable?

    I think we should encourage the citation of blog posts regardless of DOI's tbh.

    ReplyDelete
    Replies
    1. You could certainly cite a lab website, whatever.blogspot.com, someone's Google Plus page, or even a Tweet. But in terms of longevity and persistence, SourceForge has been around forever, and I don't see Google Code or Github going anywhere anytime soon. Figshare has backing from big libraries, so whatever happens to Figshare, presumably the DOI will always point to the data that were uploaded.

      It would be difficult to argue that any particular academician's lab website or blog is more persistent than any of these repositories.

      Delete
  3. I agree in principle, but I'm not particularly confident that a dot-com resource like figshare is really all that much more permanent than a typical academic site. Things like figshare seem not to be long for this world. What we really need is a government funded resource like GenBank but broader and not restricted to sequence data alone.

    ReplyDelete
    Replies
    1. But as one of the links that Stephen pointed to showed, not even NCBI resources are immune to being shut down due to lack of funding.

      Delete
    2. That's why the CLOCKSS backup is important. :) CLOCKSS exists to be the preservation solution of last resort when publishers (of various sorts; literature as well as data) run into business difficulties and shut down.

      Delete
  4. I am guilty of just this. When I was an undergrad I co-wrote an algorithm with a fellow student for mapping in populations segregating for large rearrangements. My co author wrote most of the code and the program is still up on our professors website, but over time the code is poorly annotated and bugs came up. I really ought to rewrite it all from scratch but have moved on to other positions.

    ReplyDelete
  5. I just review your every single blog post..and it was awesome..will be back to read this blog soon.

    ReplyDelete

Note: Only a member of this blog may post a comment.

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.