Monday, September 19, 2016

Primers in computational biology

I recently stumbled across this collection of computational biology primers in Nature Biotechnology. Many of these are old, but they're still great resources to get a fundamental understanding of the topic. Here they are in no particular order.


How does multiple testing correction work?

What is principal component analysis?

SNP imputation in association studies

How does gene expression clustering work?

What is a hidden Markov model?

What is a support vector machine?

What is the expectation maximization algorithm?

Where did the BLOSUM62 alignment score matrix come from?

What are DNA sequence motifs?

How does DNA sequence motif discovery work?

What are decision trees?

What is dynamic programming?

What is Bayesian statistics?

What are artificial neural networks?

How does eukaryotic gene prediction work?

How to map billions of short reads onto genomes

How to apply de Bruijn graphs to genome assembly

Thursday, June 30, 2016

Syntax Highlight Code in Keynote or Powerpoint

I came across this awesome gist explaining how to syntax highlight code in Keynote. The same trick works for Powerpoint. Mac only.
  1. Install homebrew if you don’t have it already and brew install highlight.
  2. highlight -O rtf myfile.ext | pbcopy to highlight code to a formatted text converter in RTF output format, and copy the result to the system clipboard.
  3. Paste into Keynote or Powerpoint.
If I’ve got some code in a file called eset_pca.R:
I can simply highlight -O rtf eset_pca.R | pbcopy and then paste it right into Keynote or Powerpoint.

Wednesday, June 1, 2016

Covcalc: Shiny App for Calculating Coverage Depth or Read Counts for Sequencing Experiments

How many reads do I need? What's my sequencing depth? These are common questions I get all the time. Calculating how much sequence data you need to hit a target depth of coverage, or the inverse, what's the coverage depth given a set amount of sequencing, are both easy to answer with some basic algebra. Given one or the other, plus the genome size and read length/configuration, you can calculate either. This was inspired by a similar calculator written by James Hadfield, and was an opportunity for me to create my first Shiny app.

Check out the app here:

And the source code on GitHub:

Give it your read length, whether you're using single- or paired-end sequencing, select a genome or enter your own. Then, select whether you want to calculate (a) the number of reads you need to hit a target depth of coverage, or (b) the coverage depth you'll hit given a set number of sequencing reads. Once you make the selection, use the slider to adjust either the desired coverage or number of reads sequenced, and the output text below is automatically updated.

Shiny App: Coverage / Read Count Calculator

Friday, February 5, 2016

Shiny Developer Conference 2016 Recap

This is a guest post from VP Nagraj, a data scientist embedded within UVA’s Health Sciences Library, who runs our Data Analysis Support Hub (DASH) service.

Last weekend I was fortunate enough to be able to participate in the first ever Shiny Developer Conference hosted by RStudio at Stanford University. I’ve built a handful of apps, and have taught an introductory workshop on Shiny. In spite of that, almost all of the presentations de-mystified at least one aspect of the how, why or so what of the framework. Here’s a recap of what resonated with me, as well as some code and links out to my attempts to put what I digested into practice.


  • reactivity is a beast
  • javascript isn’t cheating
  • there are already a ton of shiny features … and more on the way


For me, understanding reactivity has been one of the biggest challenges to using Shiny … or at least to using Shiny well. But after > 3 hours of an extensive (and really clear) presentation by Joe Cheng, I think I’m finally starting to see what I’ve been missing. Here’s something in particular that stuck out to me:
output$plot = renderPlot() is not an imperative to the browser to do a what … it’s a recipe for how the browser should do something.
Shiny ‘render’ functions (e.g. renderPlot(), renderText(), etc) inherently depend on reactivity. What the point above emphasizes is that assignments to a reactive expression are not the same as assignments made in “regular” R programming. Reactive outputs depend on inputs, and subsequently change as those inputs are manipulated.
If you want to watch how those changes happen in your own app, try adding options(shiny.reactlog=TRUE) to the top of your server script. When you run the app in a browser and press COMMAND + F3 (or CTRL + F3 on Windows) you’ll see a force directed network that outlines the connections between inputs and outputs.
Another way to implement reactivity is with the reactive() function.
For my apps, one of the pitfalls has been re-running the same code multiple times. That’s a perfect use-case for reactivity outside of the render functions.
Here’s a trivial example:

    ui = fluidPage(
         numericInput("threshold", "mpg threshold", value = 20),

    server = function(input, output) {

        output$size = renderPlot({

            dat = subset(mtcars, mpg > input$threshold)


        output$names = renderText({

            dat = subset(mtcars, mpg > input$threshold)


shinyApp(ui = ui, server = server)
The code above works … but it’s redundant. There’s no need to calculate the “dat” object separately in each render function.
The code below does the same thing but stores “dat” in a reactive that is only calculated once.

ui = fluidPage(
    numericInput("threshold", "mpg threshold", value = 20),

server = function(input, output) {

    dat = reactive({

        subset(mtcars, mpg > input$threshold)


    output$size = renderPlot({



    output$names = renderText({



shinyApp(ui = ui, server = server)


For whatever reason I’ve been stuck on the idea that using JavaScript inside a Shiny app would be “cheating”. But Shiny is actually well equipped for extensions with JavaScript libraries. Several of the speakers leaned in on this idea. Yihui Xie presented on the DT package, which is an interface to use features like client-side filtering from the DataTables library. And Dean Attali demonstrated shinyjs, a package that makes it really easy to incorporate JavaScript operations.
Below is code for a masterpiece that that does some hide() and show():

  ui = fluidPage( 
        titlePanel(actionButton("start", "start the game")),
        hidden(actionButton("restart", "restart the game")),

  server = function(input, output) {

        output$game_over =
                "game over, man ... game over"

       observeEvent(input$start, {

            show("game_over", anim = TRUE, animType = "fade")

       observeEvent(input$restart, {


everything else


Adding a brush argument to plotOutput() let’s you click and drag to select a points on a plot. You can use this for “zooming in” on something like a time series plot. Here’s the code for an app I wrote based on data from the babynames package - in this case the brush let’s you zoom to see name frequency over specific range of years.


ui = fluidPage(titlePanel(title = "names (1880-2012)"),
                textInput("name", "enter a name"),
                actionButton("go", "search"),
                plotOutput("plot1", brush = "plot_brush"),


server = function(input, output) {

    dat = eventReactive(input$go, {

        subset(babynames, tolower(name) == tolower(input$name))


    output$plot1 = renderPlot({

        ggplot(dat(), aes(year, prop, col=sex)) + 
            geom_line() + 
            xlim(1880,2012) +
            theme_minimal() +
            # format labels with percent function from scales package
            scale_y_continuous(labels = percent) +
            labs(list(title ="% of individuals born with name by year and gender",
                      x = "\n click-and-drag over the plot to 'zoom'",
                      y = ""))


    output$plot2 = renderPlot({

        # need latest version of shiny to use req() function
        brushed = brushedPoints(dat(), input$plot_brush)

        ggplot(brushed, aes(year, prop, col=sex)) + 
            geom_line() +
            theme_minimal() +
            # format labels with percent function from scales package
            scale_y_continuous(labels = percent) +
            labs(list(title ="% of individuals born with name by year and gender",
                      x = "",
                      y = ""))


    output$info = renderText({

        "data source: social security administration names from babynames package
}) } shinyApp(ui, server)


A relatively easy way to leverage Shiny reactivity for visual inspection and interaction with data within RStudio. The main difference here is that you’re using an abbreviated (or ‘mini’) ui. The advantage of this workflow is that you can include it in your script to make your analysis interactive. I modified the example in the documentation and wrote a basic brushing gadget that removes outliers:

outlier_rm = function(data, xvar, yvar) {

    ui = miniPage(
        gadgetTitleBar("Drag to select points"),
            # The brush="brush" argument means we can listen for
            # brush events on the plot using input$brush.
            plotOutput("plot", height = "100%", brush = "brush")

    server = function(input, output, session) {

        # Render the plot
        output$plot = renderPlot({
            # Plot the data with x/y vars indicated by the caller.
            ggplot(data, aes_string(xvar, yvar)) + geom_point()

        # Handle the Done button being pressed.
        observeEvent(input$done, {

            # create id for data
            data$id = 1:nrow(data)

            # Return the brushed points. See ?shiny::brushedPoints.
            p = brushedPoints(data, input$brush)

            # create vector of ids that match brushed points and data
            g = which(p$id %in% data$id)

            # return a subset of the original data without brushed points

    runGadget(ui, server)

# run to open plot viewer
# click and drag to brush
# press done return a subset of the original data without brushed points
outlier_rm(gapminder, "lifeExp", "gdpPercap")

# you can also use the same method above but pass the output into a dplyr pipe syntax
# without the selection what is the mean life expectancy by country?
outlier_rm(gapminder, "lifeExp", "gdpPercap") %>%
    group_by(country) %>%


This solves the issue of requiring an input - I’m definitely going to use this so I don’t have to do the return(NULL) work around:
# no need to do do this any more
# inFile = input$file1
#         if (is.null(inFile))
#             return(NULL)

# use req() instead


Super helpful method for digging into the call stack of your R code to see how you might optimize it.
One or two seconds of processing can make a big difference, particularly for a Shiny app …

rstudio connect

Jeff Allen from RStudio gave a talk on deployment options for Shiny applications and mentioned this product, which is a “coming soon” platform for hosting apps alongside RMarkdown documents and plots. It’s not available as a full release yet, but there is a beta version for testing.

Friday, January 8, 2016

Repel overlapping text labels in ggplot2

A while back I showed you how to make volcano plots in base R for visualizing gene expression results. This is just one of many genome-scale plots where you might want to show all individual results but highlight or call out important results by labeling them, for example, with a gene name.
But if you want to annotate lots of points, the annotations usually get so crowded that they overlap one another and become illegible. There are ways around this - reducing the font size, or adjusting the position or angle of the text, but these usually don’t completely solve the problem, and can even make the visualization worse. Here’s the plot again, reading the results directly from GitHub, and drawing the plot with ggplot2 and geom_text out of the box.

What a mess. It’s difficult to see what any of those downregulated genes are on the left. Enter the ggrepel package, a new extension of ggplot2 that repels text labels away from one another. Just sub in geom_text_repel() in place of geom_text() and the extension is smart enough to try to figure out how to label the points such that the labels don’t interfere with each other. Here it is in action.

And the result (much better!):
See the ggrepel package vignette for more.
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.