Getting Genetics Done: Github

Showing posts with label Github. Show all posts

Monday, June 2, 2014

Collaborative lesson development with GitHub

If you're doing any kind of scientific computing and not using version control, you're doing it wrong. The git version control system and GitHub, a web-based service for hosting and collaborating on git-controlled projects, have both become wildly popular over the last few years. Late last year GitHub announced that the 10-millionth repository had been created, and Wired recently ran an article reporting on how git and GitHub were being used to version control everything from wedding invitations to Gregorian chants to legal documents. Version control and GitHub-enabled collaboration isn't just for software development anymore.

We recently held our second Software Carpentry bootcamp at UVA where I taught the UNIX shell and version control with git. Software Carpentry keeps all its bootcamp lesson material on GitHub, where anyone is free to use these materials and encouraged to contribute back new material. The typical way to contribute to any open-source project being hosted on GitHub is the fork and pull model. That is, if I wanted to contribute to the "bc" repository developed by user "swcarpentry" (swcarpentry/bc), I would first fork the project, which creates a copy for myself that I can work on. I make some changes and additions to my fork, then submit a pull request to the developer of the original "bc" repository, requesting that they review and merge in my changes.

GitHub makes this process extremely simple and effective, and preserves the entire history of changes that were submitted and the conversation that resulted from the pull request. I recently contributed a lesson on visualization with ggplot2 to the Software Carpentry bootcamp material repository. Take a look at this pull request and all the conversation that went with it here:

https://github.com/swcarpentry/bc/pull/395

On March 27 I forked swcarpentry/bc and started making a bunch of changes and additions, creating a new ggplot2 lesson. After submitting the pull request, I instantly received tons of helpful feedback from others reviewing my lesson material. This development-review cycle went back and forth a few times, and finally, when the Software Carpentry team was satisfied with all the changes to the lesson material, those changes were merged into the official bootcamp repository (the rendered lesson can be viewed here).

Git and GitHub are excellent tools for very effectively managing conflict resolution that inevitably results from merging work done asynchronously by both small and very large teams of contributors. As of this writing, the swcarpentry/bc repository has been forked 178 times, with pull requests merged from 71 different contributors, for a total of 1,464 committed changes and counting. Next time you try reconciling "tracked changes" and comments from 71 contributors in a M$ Word or Powerpoint file, please let me know how that goes.

In the meantime, if you're collaboratively developing code, lesson material, chord progressions, song lyrics, or anything else that involves text, consider using something like git and GitHub to make your life a bit easier. There are tons of resources for learning git. I'd start with Software Carpentry's material (or better yet, find an upcoming bootcamp near you). GitHub also offers courses online and in-person training classes, both free for-fee (cheap). You can also learn git right now by trying git commands in the browser at https://try.github.io.

Wednesday, March 12, 2014

Software Carpentry at UVA, Redux

Software Carpentry is an international collaboration backed by Mozilla and the Sloan Foundation comprising a team of volunteers that teach computational competence and basic programming skills to scientists. In addition to a suite of online lessons, Software Carpentry also runs two-day on-site bootcamps to teach researchers skills such as using the Unix shell, programming in Python or R, using Git and GitHub for version control, managing data with SQL, and general programming best practices.

It was just over a year ago when I organized UVA's first bootcamp. Last year we reached our 50-person registration limit and had nearly 100 people on the wait list in less than two days. With support from the the Center for Public Health Genomics, the Health Sciences Library, and the Library's Research Data Services, we were able to host another two-day bootcamp earlier this week (we maxed out our registration limit this year as well). A few months ago I started Software Carpentry's training program, which teaches scientists how to teach other scientists how to program. It was my pleasure to be an instructor at this year's bootcamp along with Erik Bray and Mike Hansen.

Erik kicked off day one with a short introduction to what Software Carpentry is all about as well as setting the stage for the rest of the bootcamp -- as more fields of research become increasingly more data rich, computational skills become ever more critical.

I started the morning's lessons on using the Unix shell to get more stuff done in less time. Although there were still a few setup hiccups, things went a lot smoother this year because we provided a virtual machine with all of the necessary tools pre-installed.

We spent the rest of the morning and early afternoon going over version control with Git and collaboration using GitHub. I started out with the very basics -- the hows and whys of using version control, staging, committing, branching, merging, and conflict resolution. After lunch Erik and I did a live demonstration of two different modes of collaboration using GitHub. In the first, I pushed to a repo on GitHub and gave Erik full permissions to access and push to this repo. Here, we pushed and pulled to and from the same repo, and demonstrated what to do in case of a merge conflict. In the second demonstration we used the fork and pull model of collaboration: I created a new repo, Erik forked this, made some edits (using GitHub's web-based editor for simplicity), and submitted a pull request. After the demo, we had participants go through the same exercise -- creating their own repos with feedback about the course so far, and submitting pull requests to each other.

With the remaining hours in the afternoon, Erik introduced Python using the IPython notebook. Since most people were using the virtual machine we provided (or had already installed Anaconda), we had very minimal Python/IPython/numpy version and setup issues that may have otherwise plagued the entire bootcamp (most participants were using Windows laptops). By the end of the introductory python session, participants were using Python and NumPy to simulate logistic population growth with intermittent catastrophic population crashes, and using matplotlib to visualize the results.

Next, Mike introduced the pandas data analysis library for Python, also using an IPython notebook for teaching. In this session, participants used pandas to import and analyze year's worth of weather data from Weather Underground. Participants imported a CSV file, cleaned up the data, parsed dates written as text to create python datetime objects, used the apply function to perform bulk operations on the data, learned how to handle missing values, and synthesized many of the individual components taught in this and the previous session to partition out and perform summary operations on subsets of the data that matched particular criteria of interest (e.g., "how many days did it rain in November when the minimum temperature ranged from 20 to 32 degrees?").

Erik wrapped up the bootcamp with a session on testing code. Erik introduced the concept of testing by demonstrating the behavior of a function without revealing the source code behind it. Participants were asked to figure out what the function did by writing various tests with different input. Finally, participants worked in pairs to implement the function such that all the previously written tests would not raise any assertion errors.

Overall, our second Software Carpentry bootcamp was a qualitative success. The fact that we maxed out registration and filled a wait list within hours two years in a row demonstrates the overwhelming need for such a curriculum for scientists. Science across nearly every discipline is becoming ever more quantitative; researchers are realizing that to be successful, not only do you need to be a good scientist, a great writer, an eloquent speaker, a skilled graphic designer, a clever marketer, an efficient project manager, etc., but that you'll also need to know some programming and statistics also. This week represented the largest Software Carpentry event ever, with simultaneous bootcamps at the University of Virginia, Purdue, New York University, UC Berkeley, and the University of Washington. I can only imagine this trend will continue for the foreseeable future.

Monday, October 21, 2013

Useful Unix/Linux One-Liners for Bioinformatics

Much of the work that bioinformaticians do is munging and wrangling around massive amounts of text. While there are some "standardized" file formats (FASTQ, SAM, VCF, etc.) and some tools for manipulating them (fastx toolkit, samtools, vcftools, etc.), there are still times where knowing a little bit of Unix/Linux is extremely helpful, namely awk, sed, cut, grep, GNU parallel, and others.

This is by no means an exhaustive catalog, but I've put together a short list of examples using various Unix/Linux utilities for text manipulation, from the very basic (e.g., sum a column) to the very advanced (munge a FASTQ file and print the total number of reads, total number unique reads, percentage of unique reads, most abundant sequence, and its frequency). Most of these examples (with the exception of the SeqTK examples) use built-in utilities installed on nearly every Linux system. These examples are a combination of tactics I used everyday and examples culled from other sources listed at the top of the page.

The list is available as a README in this GitHub repo. This list is a start - I would love suggestions for other things to include. To make a suggestion, leave a comment here, or better - open an issue, or even better still - send me a pull request.

Useful one-liners for bioinformatics: https://github.com/stephenturner/oneliners

Alternatively, download a PDF here.

This blog has moved!