Getting Genetics Done: Conferences

Showing posts with label Conferences. Show all posts

Saturday, January 14, 2017

RStudio Conference 2017 Recap

The first ever RStudio conference was held January 11-14, 2017 in Orlando, FL. For anyone else like me who spends hours each working day staring into an RStudio session, the conference was truly excellent. The speaker lineup was diverse and covered lots of areas related to development in R, including the tidyverse, the RStudio IDE, Shiny, htmlwidgets, and authoring with RMarkdown.

This is not a complete list by any means — with split sessions I could only go to half the talks at most. Here are some noncomprehensive notes and links to slides and resources for some of the awesome things are doing with R and RStudio that I learned about at the RStudio Conference.

Hadley Wickham kicked off the meeting with a keynote on doing data science in R. The talk focused on the tidyverse, and the notion of splitting functions into commands that do something, as compared to queries that calculate something, and how it’s generally a good idea to keep these different functionalties contained in their own separate functions. (Contrast this to things like lm that both computes values and does things, like printing those values to the screen, making it difficult to capture (see broom).

I asked Hadley after his talk about strategies to reduce issues getting Bioconductor data structures to play nicely with tidyverse tools. Within minutes David Robinson released a new feature in the fuzzyjoin package that leverages IRanges within this tidyverse-friendly package for efficiently doing things like joining on genomic intervals.

Another #rstudioconf-inspired addition to fuzzyjoin:

genome_join, for overlapping intervals on the same chromosome@genetics_blog #rstats pic.twitter.com/oUctyNYc09
— David Robinson (@drob) January 13, 2017

Charlotte Wickham’s 2-hour purrr tutorial was awesome. Here’s a link to a shared dropbox folder with code, challenges, slides, data, etc. The purrr package is a core package in the tidyverse, and I’ll be replacing many of the base ?apply and plyr ??ply functions that I still use here and there. The map_* functions are integral to working with nested list-columns in dplyr, and I think I’m finally starting to grok how to work with these.

Jenny Bryan gave a great talk on list columns. You can see her slides here. Jenny also put together this excellent tutorial with lots of worked examples and code snippets. And if you need some example list data structures for more practice or for teaching that aren’t foo/bar/iris/mtcars-level boring, see her repurrrsive package. Related to this, for more on list columns and purrr map functions, start reading at the “Many Models” section of Hadley’s R for Data Science book.

Julia Silge, data scientist at Stack Overflow, gave a great introduction to tidy text mining with R. You can read Julia and David’s Tidy Text Mining with R book here online (the book was authored in Rmarkdown using bookdown!).

Andrew Flowers, data journalist and former writer at FiveThirtyEight gave the second day’s keynote address on finding and telling stories using R. He gave a series of examples illustrating six motivating features that make data stories worth telling, along with potential danger inherent to each one:

Novelty (potential danger: triviality)
Outlier (spurious result; see also, p-hacking)
Archetype (oversimplification)
Trend (variance)
Debunking (confirmation bias)
Forecast (overfitting)

Yihui Xie led a two-hour tutorial on advanced RMarkdown. You can see his slides here. The rticles package has LaTeX Journal Article Templates for R Markdown for various journals. The tufte package now supports both PDF and HTML output. See an example here. Yihui’s xaringan package ports the remark.js library for slideshows into R. Careful. Yihui warns that you may not sleep after learning about how cool remark.js is. Yihui showed an early version of the in-development blogdown package that can build blog-aware static websites using the blazing-fast and well-documented Hugo static site generator. Finally, the bookdown package is just awesome. It takes multiple RMarkdown documents as input and renders into multiple output formats (screen-readable ebook, PDF, epub, etc.). It looks great for writing books and technical documentation with pushbutton publishing to multiple output formats with some nice built-in styles out of the box. Some examples:

bookdown.org/yihui/bookdown — The bookdown book, written in RMarkdown with bookdown. (whoa, meta)
r4ds.had.co.nz — Garrett Grolemund and Hadley Wickham’s R for Data Science book.
tidytextmining.com — Julia and David’s book on text mining
moderndive.com — an open-source introductory statistics class textbook

Finally, a few gems from other talks that I jotted down:

Chester Ismay gave a great talk on teaching introductory statistics using R, with the open-source course textbook written in RMarkdown using bookdown.
Bob Rudis talked about using pipes (%>%), and pipes within pipes, and best piping practices. See his slides here.
Hilary Parker talked about the idea of an analysis development, (and analysis developers), drawing similarities to software development/developers. Hilary discussed this once before on the excellent podcast that she and Roger Peng host, and you can probably find it in their Conversations On Data Science ebook that summarize and transcribe these conversations.
Simon Jackson introduced corrr package for exploring and manipulating correlations and correlation matrices in a tidy way.
Gordon Shotwell introduced the easymake package that generates Makefiles from a data frame using R.
Karthik Ram quickly introduced several of the (many) rOpenSci packages related to data publication, data access, scientific literature access, scalable & reproducible computing, databases, visualization, taxonomy, geospatial analysis, and many utility tools for data analysis and manipulation.

With split sessions I missed more than half the talks. Lots of people here are active on Twitter, and you can catch many more notes and tidbits on the #rstudioconf hashtag. The meeting was superbly organized, I learned a ton, and I enjoyed meeting in person many of the folks I follow on Twitter and elsewhere online. A few days of 80-degree weather in mid-January didn’t hurt either. I’ll definitely be coming again next year. Kudos to the rstudio::conf organizers and speakers!

All the talks were recorded and will supposedly find their way to rstudio.com at some point soon. I’ll update this post with a link when that happens.

Update Feb 16, 2017: All the talks have now been posted online here under the rstudio::conf2017 heading.

Day 1:

Day 2:

Friday, February 5, 2016

Shiny Developer Conference 2016 Recap

This is a guest post from VP Nagraj, a data scientist embedded within UVA’s Health Sciences Library, who runs our Data Analysis Support Hub (DASH) service.

Last weekend I was fortunate enough to be able to participate in the first ever Shiny Developer Conference hosted by RStudio at Stanford University. I’ve built a handful of apps, and have taught an introductory workshop on Shiny. In spite of that, almost all of the presentations de-mystified at least one aspect of the how, why or so what of the framework. Here’s a recap of what resonated with me, as well as some code and links out to my attempts to put what I digested into practice.

tl;dr

reactivity is a beast
javascript isn’t cheating
there are already a ton of shiny features … and more on the way

reactivity

For me, understanding reactivity has been one of the biggest challenges to using Shiny … or at least to using Shiny well. But after > 3 hours of an extensive (and really clear) presentation by Joe Cheng, I think I’m finally starting to see what I’ve been missing. Here’s something in particular that stuck out to me:

output$plot = renderPlot() is not an imperative to the browser to do a what … it’s a recipe for how the browser should do something.

Shiny ‘render’ functions (e.g. renderPlot(), renderText(), etc) inherently depend on reactivity. What the point above emphasizes is that assignments to a reactive expression are not the same as assignments made in “regular” R programming. Reactive outputs depend on inputs, and subsequently change as those inputs are manipulated.

If you want to watch how those changes happen in your own app, try adding options(shiny.reactlog=TRUE) to the top of your server script. When you run the app in a browser and press COMMAND + F3 (or CTRL + F3 on Windows) you’ll see a force directed network that outlines the connections between inputs and outputs.

Another way to implement reactivity is with the reactive() function.
For my apps, one of the pitfalls has been re-running the same code multiple times. That’s a perfect use-case for reactivity outside of the render functions.

Here’s a trivial example:

library(shiny)

    ui = fluidPage(
         numericInput("threshold", "mpg threshold", value = 20),
         plotOutput("size"),
         textOutput("names")
    )

    server = function(input, output) {

        output$size = renderPlot({

            dat = subset(mtcars, mpg > input$threshold)
            hist(dat$wt)

        })

        output$names = renderText({

            dat = subset(mtcars, mpg > input$threshold)
            rownames(dat)

        })
    }

shinyApp(ui = ui, server = server)

The code above works … but it’s redundant. There’s no need to calculate the “dat” object separately in each render function.

The code below does the same thing but stores “dat” in a reactive that is only calculated once.

library(shiny)

ui = fluidPage(
    numericInput("threshold", "mpg threshold", value = 20),
    plotOutput("size"),
    textOutput("names")
)

server = function(input, output) {

    dat = reactive({

        subset(mtcars, mpg > input$threshold)

    })

    output$size = renderPlot({

        hist(dat()$wt)

    })

    output$names = renderText({

        rownames(dat())

    })
}

shinyApp(ui = ui, server = server)

javascript

For whatever reason I’ve been stuck on the idea that using JavaScript inside a Shiny app would be “cheating”. But Shiny is actually well equipped for extensions with JavaScript libraries. Several of the speakers leaned in on this idea. Yihui Xie presented on the DT package, which is an interface to use features like client-side filtering from the DataTables library. And Dean Attali demonstrated shinyjs, a package that makes it really easy to incorporate JavaScript operations.

Below is code for a masterpiece that that does some hide() and show():

# https://apps.bioconnector.virginia.edu/game
library(shiny)
library(shinyjs)
shinyApp(

  ui = fluidPage( 
        titlePanel(actionButton("start", "start the game")),
        useShinyjs(),
        hidden(actionButton("restart", "restart the game")),
        tags$h3(hidden(textOutput("game_over")))
  ),

  server = function(input, output) {

        output$game_over =
            renderText({
                "game over, man ... game over"
            })  

       observeEvent(input$start, {

            show("game_over", anim = TRUE, animType = "fade")
            hide("start")
            show("restart")
        })

       observeEvent(input$restart, {
            hide("game_over")
            hide("restart")
            show("start")
        })

  }
)

everything else

brushing

http://shiny.rstudio.com/articles/plot-interaction.html

Adding a brush argument to plotOutput() let’s you click and drag to select a points on a plot. You can use this for “zooming in” on something like a time series plot. Here’s the code for an app I wrote based on data from the babynames package - in this case the brush let’s you zoom to see name frequency over specific range of years.

# http://apps.bioconnector.virginia.edu/names/
library(shiny)
library(ggplot2)
library(ggthemes)
library(babynames)
library(scales)

options(scipen=999)

ui = fluidPage(titlePanel(title = "names (1880-2012)"),
                textInput("name", "enter a name"),
                actionButton("go", "search"),
                plotOutput("plot1", brush = "plot_brush"),
                plotOutput("plot2"),
                htmlOutput("info")

)

server = function(input, output) {

    dat = eventReactive(input$go, {

        subset(babynames, tolower(name) == tolower(input$name))

    })

    output$plot1 = renderPlot({

        ggplot(dat(), aes(year, prop, col=sex)) + 
            geom_line() + 
            xlim(1880,2012) +
            theme_minimal() +
            # format labels with percent function from scales package
            scale_y_continuous(labels = percent) +
            labs(list(title ="% of individuals born with name by year and gender",
                      x = "\n click-and-drag over the plot to 'zoom'",
                      y = ""))

    })

    output$plot2 = renderPlot({

        # need latest version of shiny to use req() function
        req(input$plot_brush)
        brushed = brushedPoints(dat(), input$plot_brush)

        ggplot(brushed, aes(year, prop, col=sex)) + 
            geom_line() +
            theme_minimal() +
            # format labels with percent function from scales package
            scale_y_continuous(labels = percent) +
            labs(list(title ="% of individuals born with name by year and gender",
                      x = "",
                      y = ""))

    })

    output$info = renderText({

        "data source: social security administration names from babynames package

"

    })

}

shinyApp(ui, server)

gadgets

http://shiny.rstudio.com/articles/gadgets.html

A relatively easy way to leverage Shiny reactivity for visual inspection and interaction with data within RStudio. The main difference here is that you’re using an abbreviated (or ‘mini’) ui. The advantage of this workflow is that you can include it in your script to make your analysis interactive. I modified the example in the documentation and wrote a basic brushing gadget that removes outliers:

library(shiny)
library(miniUI)
library(ggplot2)

outlier_rm = function(data, xvar, yvar) {

    ui = miniPage(
        gadgetTitleBar("Drag to select points"),
        miniContentPanel(
            # The brush="brush" argument means we can listen for
            # brush events on the plot using input$brush.
            plotOutput("plot", height = "100%", brush = "brush")
            )
        )

    server = function(input, output, session) {

        # Render the plot
        output$plot = renderPlot({
            # Plot the data with x/y vars indicated by the caller.
            ggplot(data, aes_string(xvar, yvar)) + geom_point()
        })

        # Handle the Done button being pressed.
        observeEvent(input$done, {

            # create id for data
            data$id = 1:nrow(data)

            # Return the brushed points. See ?shiny::brushedPoints.
            p = brushedPoints(data, input$brush)

            # create vector of ids that match brushed points and data
            g = which(p$id %in% data$id)

            # return a subset of the original data without brushed points
            stopApp(data[-g,])
        })
    }

    runGadget(ui, server)
}

# run to open plot viewer
# click and drag to brush
# press done return a subset of the original data without brushed points
library(gapminder)
outlier_rm(gapminder, "lifeExp", "gdpPercap")

# you can also use the same method above but pass the output into a dplyr pipe syntax
# without the selection what is the mean life expectancy by country?
library(dplyr)
outlier_rm(gapminder, "lifeExp", "gdpPercap") %>%
    group_by(country) %>%
    summarise(mean(lifeExp))

req()

http://shiny.rstudio.com/reference/shiny/latest/req.html

This solves the issue of requiring an input - I’m definitely going to use this so I don’t have to do the return(NULL) work around:

# no need to do do this any more
# 
# inFile = input$file1
# 
#         if (is.null(inFile))
#             return(NULL)

# use req() instead
req(input$file1)

profvis

http://rpubs.com/wch/123888

Super helpful method for digging into the call stack of your R code to see how you might optimize it.

One or two seconds of processing can make a big difference, particularly for a Shiny app …

rstudio connect

https://www.rstudio.com/rstudio-connect-beta/

Jeff Allen from RStudio gave a talk on deployment options for Shiny applications and mentioned this product, which is a “coming soon” platform for hosting apps alongside RMarkdown documents and plots. It’s not available as a full release yet, but there is a beta version for testing.

Monday, November 2, 2015

Software from CSHL Genome Informatics 2015

I just returned from the Genome Informatics meeting at Cold Spring Harbor. This was, hands down, the best scientific conference I've been to in years. The quality of the talks and posters was excellent, and it was great meeting in person many of the scientists and developers whose tools and software I use on a daily basis. To get a sense of what the meeting was about, 140 characters at a time, you can access all the Tweets sent Oct 28-31 2015 tagged #gi2015 at this link.

Below is a very short list of software that was presented at GI2015. This is only a tiny slice of the tools and methods that were presented at the meeting, and the list is highly biased toward tools that I personally find interesting or useful to my own work (please don't be offended if I omitted your stuff, and feel free to mention it in the comments).

Monocle: Software for analyzing single-cell RNA-seq data
Paper: http://www.nature.com/nbt/journal/v32/n4/full/nbt.2859.html
Software: http://cole-trapnell-lab.github.io/monocle-release/

Kallisto: very fast RNA-seq transcript abundance estimation using pseudoalignment.
Preprint: http://arxiv.org/abs/1505.02710
Software: http://pachterlab.github.io/kallisto/about.html

Sleuth: R package for analyzing & reporting differential expression analysis from transcript abundances estimated with Kallisto.
Preprint: coming soon?
Software: http://pachterlab.github.io/sleuth/about.html
See also: The bear's lair (http://lair.berkeley.edu/): reanalysis of published RNA-seq studies using kallisto+sleuth.

QoRTs: Quality of RNA-Seq Toolset. Toolkit for QC, gene/junction counting, and other miscellaneous downstream processing from RNA-seq alignments.

Software: https://github.com/hartleys/QoRTs

Paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4506620/

JunctionSeq: R package for testing differential junction usage with RNA-seq data.

Software: https://github.com/hartleys/JunctionSeq

Vignette: http://hartleys.github.io/JunctionSeq/doc/JunctionSeq.pdf

HISAT2: RNA-seq alignment against populations of genomes (aligns DNA also).
Software: http://ccb.jhu.edu/software/hisat2/index.shtml

Rail: software for aligning many-sample RNA-seq data, producing alignments, genome coverage bigWigs, and splice junction BED files.
Software: http://rail.bio
Preprint: http://biorxiv.org/content/early/2015/08/11/019067

LobSTR: genotype short tandem repeats from NGS data.
Software: http://melissagymrek.com/lobstr-code/
Paper: http://www.ncbi.nlm.nih.gov/pubmed/22522390

Basset: convolutional neural networks for learning functional/regulatory features of DNA sequence.
Software: https://github.com/davek44/Basset
Preprint: http://biorxiv.org/content/early/2015/10/05/028399

Genotype Query Tools (GQT): fast/efficient individual-level queries of large-scale variation data.
Software: https://github.com/ryanlayer/gqt
Preprint: http://biorxiv.org/content/early/2015/06/05/018259

Centrifuge: a metagenomics classifier.
Software: https://github.com/infphilo/centrifuge
Poster: http://www.ccb.jhu.edu/people/infphilo/data/Centrifuge-poster.pdf

Mash: MinHash-based method for rapidly estimating pairwise distances between genomes or metagenomes.
Software: https://github.com/marbl/Mash
Docs: http://mash.readthedocs.org/en/latest/
Preprint: http://biorxiv.org/content/early/2015/10/26/029827

VCFanno: ultrafast large-sample VCF annotation
Software: https://github.com/brentp/vcfanno

Ginkgo: Interactive analysis and assessment of single-cell copy-number variations
Paper: http://www.nature.com/nmeth/journal/v12/n11/full/nmeth.3578.html
Software: https://github.com/robertaboukhalil/ginkgo

StringTie: RNA-seq transcript assembly+quantification, with or without a reference. See paper for comparison to existing tools.
Software: http://ccb.jhu.edu/software/stringtie/
Source: https://github.com/gpertea/stringtie
Poster: http://ccb.jhu.edu/software/stringtie/cshl2015.pdf
Paper: http://www.nature.com/nbt/journal/v33/n3/full/nbt.3122.html

Friday, April 10, 2015

Translational Bioinformatics Year In Review

Per tradition, Russ Altman gave his "Translational Bioinformatics: The Year in Review" presentation at the close of the AMIA Joint Summit on Translational Bioinformatics in San Francisco on March 26th. This year, papers came from six key areas (and a final Odds and Ends category). His full slide deck is available here.

I always enjoy this talk because it routinely points me to new collections of data and new software tools that are useful for a variety of analyses; as such, I thought I would highlight these resources from his talk this year.

GRASP: analysis of genotype-phenotype results from1390 genome-wide association studies and corresponding open access database
Some of you may have accessed the Johnson and O'Donnell catalog of GWAS results published in 2009. This data set was a more extensive collection of GWAS findings than the popular NHGRI GWAS catalog, as it did not impose a genome-wide significance threshold for reported associations. The GRASP database is a similar effort, reporting numerous attributes of each study.
A zip archive of the full data set (a flat file) is available here.

Effective diagnosis of genetic disease by computational phenotype analysis of the disease associated genome
This paper tackles the enormously complex task of diagnosing rare genetic diseases using a combination of genetic variants (from a VCF file), a list of phenotype characteristics (fed from the Human Phenotype Ontology), and a few other aspects of the disease.
The online tool called PhenIX is available here.

A network based method for analysis of lncRNA disease associations and prediction of lncRNAs implicated in diseases
Here, Yang et al. examine relationships between known long non-coding RNAs and disease using graph propagation. Their underlying database, however, was generated using PubMed mining along with some manual curation.
Their lncRNA-Disease database is available here.

SNPsea: an algorithm to identify cell types, tissuesand pathways affected by risk loci
This tool is a type of SNP set enrichment, designed to specifically look at functional enrichment in the context of specific tissues and cell types. The tool is a C++ executable, available for download here.
The data sources underlying the SNPsea algorithm are available here.

Human symptoms-disease network
Here Zhou et al. systematically extract symptom-to-disease network by exploting MeSH annotations. They compiled a list of 322 symptoms and 4,442 diseases from the MeSH vocabulary, and document their occurrence within PubMed. Using this disease-symptom network, the authors explore the biological underpinnings of certain symptoms by looking at shared genomic elements between diseases with similar symptoms.
The full list of ~130,000 edges in their disease-symptom network is available here.

A circadian gene expression atlas in mammals: implications for biology and medicine
This fascinating paper explores the temporal impact on gene expression traits from 12 mouse organs. By systematically collecting transcriptome data from these tissues at two hour intervals, the authors construct a temporal atlas of gene expression, and show that 43% of proteins have a circadian expression profile.
The accompanying CircaDB database is available online here.

dRiskKB: a large-scale disease-disease riskrelationship knowledge base constructed frombiomedical text
The authors of dRiskKB use text mining across MEDLINE citations using a controlled disease vocabulary, in this case the Human Disease Ontology, to generate pairs of diseases that co-occur with specific patterns in abstract text. These pairs are ranked with a scoring algorithm and provide a new resource for disease co-morbidity relationships.
The flat file data driving dRiskKB can be found online here.

A tissue-based map of the human proteome
In this major effort, a group of investigators have published the most detailed atlas of human protein expression to date. The transcriptome has been extensively studied across human tissues, but it remains unclear to what extent transcriptional activity reflects translation into protein. But most importantly, the data are searchable via a beautiful website.
The underlying data from the Human Protein Atlas is available here.

Tuesday, April 22, 2014

Russ Altman's Translational Bioinformatics Year in Review

A few weeks ago the 2014 AMIA Translational Bioinformatics Meeting (TBI) was held in beautiful San Francisco. This meeting is full of great science that spans the divide between molecular and clinical research, but a true highlight of this meeting is the closing keynote, traditionally given by Russ Altman. Each year, he highlights the most exciting articles/publications in translational bioinformatics, and presents them all in a tidy, entertaining talk. You can find his blog here, along with a link to his slide deck.

To make accessing the papers a bit easier, here is a list of links.

Controversies:

FDA and 23andMe

Lior Pachter's "Network Nonsense"

Inconsistency in Large Pharmacogenomic Studies

Clinical Genomics:

Warfarin Dosing with Pharmacogenomics

Clinically Actionable Genotypes in Patients with Premptive Pharmacogenomic Testing

Genic Intolerance to Functional Variation

Estimating the Relative Pathogenicity of Human Genetic Variants

Analyzing the Incidentalome

Whole Genome Sequencing in Support of Wellness and Health Maintenance