I work with gene lists on a nearly daily basis. Lists of genes near ChIP-seq peaks, lists of genes closest to a GWAS hit, lists of differentially expressed genes or transcripts from an RNA-seq experiment, lists of genes involved in certain pathways, etc. And lots of times I’ll need to convert these gene IDs from one identifier to another. There’s no shortage of tools to do this. I use Ensembl Biomart. But I do this so often that I got tired of hammering Ensembl’s servers whenever I wanted to convert from Ensembl to Entrez gene IDs for pathway mapping, get the chromosomal location for some BEDTools-y kinds of genomic arithmetic, or get the gene symbol and full description for reporting. So I used Biomart to retrieve the data that I use most often, cleaned up the column names, and saved this data as an R data package called annotables.
This package has basic annotation information from Ensembl release 82 for:
- Human (
grch38
) - Mouse (
grcm38
) - Rat (
rnor6
) - Chicken (
galgal4
) - Worm (
wbcel235
) - Fly (
bdgp6
)
Where each table contains:
ensgene
: Ensembl gene IDentrez
: Entrez gene IDsymbol
: Gene symbolchr
: Chromosomestart
: Startend
: Endstrand
: Strandbiotype
: Protein coding, pseudogene, mitochondrial tRNA, etc.description
: Full gene name/description.
Additionally, there are tables for human and mouse (
grch38_gt
and grcm38_gt
, respectively) that link ensembl gene IDs to ensembl transcript IDs.Usage
The package isn’t on CRAN, so you’ll need devtools to install it.
# If you haven't already installed devtools...
install.packages("devtools")
# Use devtools to install the package
devtools::install_github("stephenturner/annotables")
It isn’t necessary to load dplyr, but the tables are
tbl_df
and will print nicely if you have dplyr loaded.library(dplyr)
library(annotables)
Look at the human genes table (note the description column gets cut off because the table becomes too wide to print nicely):
grch38
## Source: local data frame [66,531 x 9]
##
## ensgene entrez symbol chr start end strand biotype
## (chr) (int) (chr) (chr) (int) (int) (int) (chr)
## 1 ENSG00000210049 NA MT-TF MT 577 647 1 Mt_tRNA
## 2 ENSG00000211459 NA MT-RNR1 MT 648 1601 1 Mt_rRNA
## 3 ENSG00000210077 NA MT-TV MT 1602 1670 1 Mt_tRNA
## 4 ENSG00000210082 NA MT-RNR2 MT 1671 3229 1 Mt_rRNA
## 5 ENSG00000209082 NA MT-TL1 MT 3230 3304 1 Mt_tRNA
## 6 ENSG00000198888 4535 MT-ND1 MT 3307 4262 1 protein_coding
## 7 ENSG00000210100 NA MT-TI MT 4263 4331 1 Mt_tRNA
## 8 ENSG00000210107 NA MT-TQ MT 4329 4400 -1 Mt_tRNA
## 9 ENSG00000210112 NA MT-TM MT 4402 4469 1 Mt_tRNA
## 10 ENSG00000198763 4536 MT-ND2 MT 4470 5511 1 protein_coding
## .. ... ... ... ... ... ... ... ...
## Variables not shown: description (chr)
Look at the human genes-to-transcripts table:
grch38_gt
## Source: local data frame [216,133 x 2]
##
## ensgene enstxp
## (chr) (chr)
## 1 ENSG00000210049 ENST00000387314
## 2 ENSG00000211459 ENST00000389680
## 3 ENSG00000210077 ENST00000387342
## 4 ENSG00000210082 ENST00000387347
## 5 ENSG00000209082 ENST00000386347
## 6 ENSG00000198888 ENST00000361390
## 7 ENSG00000210100 ENST00000387365
## 8 ENSG00000210107 ENST00000387372
## 9 ENSG00000210112 ENST00000387377
## 10 ENSG00000198763 ENST00000361453
## .. ... ...
Tables are
tbl_df
, pipe-able with dplyr:grch38 %>%
filter(biotype=="protein_coding" & chr=="1") %>%
select(ensgene, symbol, chr, start, end, description) %>%
head %>%
pander::pandoc.table(split.table=100, justify="llllll", style="rmarkdown")
ensgene | symbol | chr | start | end |
---|---|---|---|---|
ENSG00000158014 | SLC30A2 | 1 | 26037252 | 26046133 |
ENSG00000173673 | HES3 | 1 | 6244192 | 6245578 |
ENSG00000243749 | ZMYM6NB | 1 | 34981535 | 34985353 |
ENSG00000189410 | SH2D5 | 1 | 20719732 | 20732837 |
ENSG00000116863 | ADPRHL2 | 1 | 36088875 | 36093932 |
ENSG00000188643 | S100A16 | 1 | 153606886 | 153613145 |
Table: Table continues below
description |
---|
solute carrier family 30 (zinc transporter), member 2 [Source:HGNC Symbol;Acc:HGNC:11013] |
hes family bHLH transcription factor 3 [Source:HGNC Symbol;Acc:HGNC:26226] |
ZMYM6 neighbor [Source:HGNC Symbol;Acc:HGNC:40021] |
SH2 domain containing 5 [Source:HGNC Symbol;Acc:HGNC:28819] |
ADP-ribosylhydrolase like 2 [Source:HGNC Symbol;Acc:HGNC:21304] |
S100 calcium binding protein A16 [Source:HGNC Symbol;Acc:HGNC:20441] |
Example with RNA-seq data
Here’s an example with RNA-seq data. Specifically, DESeq2 results from the airway package, made tidy with biobroom:
# Load libraries (install with Bioconductor if you don't have them)
library(DESeq2)
library(airway)
# Load the data and do the RNA-seq data analysis
data(airway)
airway = DESeqDataSet(airway, design = ~cell + dex)
airway = DESeq(airway)
res = results(airway)
# tidy results with biobroom
library(biobroom)
res_tidy = tidy.DESeqResults(res)
head(res_tidy)
## Source: local data frame [6 x 7]
##
## gene baseMean estimate stderror statistic
## (chr) (dbl) (dbl) (dbl) (dbl)
## 1 ENSG00000000003 708.6021697 0.37424998 0.09873107 3.7906000
## 2 ENSG00000000005 0.0000000 NA NA NA
## 3 ENSG00000000419 520.2979006 -0.20215550 0.10929899 -1.8495642
## 4 ENSG00000000457 237.1630368 -0.03624826 0.13684258 -0.2648902
## 5 ENSG00000000460 57.9326331 0.08523370 0.24654400 0.3457140
## 6 ENSG00000000938 0.3180984 0.11555962 0.14630523 0.7898530
## Variables not shown: p.value (dbl), p.adjusted (dbl)
Now, make a table with the results (unfortunately, it’ll be split in this display, but you can write this to file to see all the columns in a single row):
res_tidy %>%
arrange(p.adjusted) %>%
head(20) %>%
inner_join(grch38, by=c("gene"="ensgene")) %>%
select(gene, estimate, p.adjusted, symbol, description) %>%
pander::pandoc.table(split.table=100, justify="lrrll", style="rmarkdown")
gene | estimate | p.adjusted | symbol |
---|---|---|---|
ENSG00000152583 | -4.316 | 4.753e-134 | SPARCL1 |
ENSG00000165995 | -3.189 | 1.44e-133 | CACNB2 |
ENSG00000101347 | -3.618 | 6.619e-125 | SAMHD1 |
ENSG00000120129 | -2.871 | 6.619e-125 | DUSP1 |
ENSG00000189221 | -3.231 | 9.468e-119 | MAOA |
ENSG00000211445 | -3.553 | 3.94e-107 | GPX3 |
ENSG00000157214 | -1.949 | 8.74e-102 | STEAP2 |
ENSG00000162614 | -2.003 | 3.052e-98 | NEXN |
ENSG00000125148 | -2.167 | 1.783e-92 | MT2A |
ENSG00000154734 | -2.286 | 4.522e-86 | ADAMTS1 |
ENSG00000139132 | -2.181 | 2.501e-83 | FGD4 |
ENSG00000162493 | -1.858 | 4.215e-83 | PDPN |
ENSG00000162692 | 3.453 | 3.563e-82 | VCAM1 |
ENSG00000179094 | -3.044 | 1.199e-81 | PER1 |
ENSG00000134243 | -2.149 | 2.73e-81 | SORT1 |
ENSG00000163884 | -4.079 | 1.073e-80 | KLF15 |
ENSG00000178695 | 2.446 | 6.275e-75 | KCTD12 |
ENSG00000146250 | 2.64 | 1.143e-69 | PRSS35 |
ENSG00000198624 | -2.784 | 1.707e-69 | CCDC69 |
ENSG00000148848 | 1.783 | 1.762e-69 | ADAM12 |
Table: Table continues below
description |
---|
SPARC-like 1 (hevin) [Source:HGNC Symbol;Acc:HGNC:11220] |
calcium channel, voltage-dependent, beta 2 subunit [Source:HGNC Symbol;Acc:HGNC:1402] |
SAM domain and HD domain 1 [Source:HGNC Symbol;Acc:HGNC:15925] |
dual specificity phosphatase 1 [Source:HGNC Symbol;Acc:HGNC:3064] |
monoamine oxidase A [Source:HGNC Symbol;Acc:HGNC:6833] |
glutathione peroxidase 3 [Source:HGNC Symbol;Acc:HGNC:4555] |
STEAP family member 2, metalloreductase [Source:HGNC Symbol;Acc:HGNC:17885] |
nexilin (F actin binding protein) [Source:HGNC Symbol;Acc:HGNC:29557] |
metallothionein 2A [Source:HGNC Symbol;Acc:HGNC:7406] |
ADAM metallopeptidase with thrombospondin type 1 motif, 1 [Source:HGNC Symbol;Acc:HGNC:217] |
FYVE, RhoGEF and PH domain containing 4 [Source:HGNC Symbol;Acc:HGNC:19125] |
podoplanin [Source:HGNC Symbol;Acc:HGNC:29602] |
vascular cell adhesion molecule 1 [Source:HGNC Symbol;Acc:HGNC:12663] |
period circadian clock 1 [Source:HGNC Symbol;Acc:HGNC:8845] |
sortilin 1 [Source:HGNC Symbol;Acc:HGNC:11186] |
Kruppel-like factor 15 [Source:HGNC Symbol;Acc:HGNC:14536] |
potassium channel tetramerization domain containing 12 [Source:HGNC Symbol;Acc:HGNC:14678] |
protease, serine, 35 [Source:HGNC Symbol;Acc:HGNC:21387] |
coiled-coil domain containing 69 [Source:HGNC Symbol;Acc:HGNC:24487] |
ADAM metallopeptidase domain 12 [Source:HGNC Symbol;Acc:HGNC:190] |
Explore!
This data can also be used for toying around with dplyr verbs and generally getting a sense of what’s in here. First, tet some help.
ls("package:annotables")
?grch38
Let’s join the transcript table to the gene table.
gt = grch38_gt %>%
inner_join(grch38, by="ensgene")
Now, let’s filter to get only protein-coding genes, group by the ensembl gene ID, summarize to count how many transcripts are in each gene, inner join that result back to the original gene list, so we can select out only the gene, number of transcripts, symbol, and description, mutate the description column so that it isn’t so wide that it’ll break the display, arrange the returned data descending by the number of transcripts per gene, head to get the top 10 results, and optionally, pipe that to further utilities to output a nice HTML table.
gt %>%
filter(biotype=="protein_coding") %>%
group_by(ensgene) %>%
summarize(ntxps=n_distinct(enstxp)) %>%
inner_join(grch38, by="ensgene") %>%
select(ensgene, ntxps, symbol, description) %>%
mutate(description=substr(description, 1, 20)) %>%
arrange(desc(ntxps)) %>%
head(10) %>%
pander::pandoc.table(split.table=100, justify="lrll", style="rmarkdown")
ensgene | ntxps | symbol | description |
---|---|---|---|
ENSG00000165795 | 77 | NDRG2 | NDRG family member 2 |
ENSG00000205336 | 77 | ADGRG1 | adhesion G protein-c |
ENSG00000196628 | 75 | TCF4 | transcription factor |
ENSG00000161249 | 68 | DMKN | dermokine [Source:HG |
ENSG00000154556 | 64 | SORBS2 | sorbin and SH3 domai |
ENSG00000166444 | 62 | ST5 | suppression of tumor |
ENSG00000204580 | 58 | DDR1 | discoidin domain rec |
ENSG00000087460 | 57 | GNAS | GNAS complex locus [ |
ENSG00000169398 | 57 | PTK2 | protein tyrosine kin |
ENSG00000104529 | 56 | EEF1D | eukaryotic translati |
Let’s look up DMKN (dermkine) in Ensembl. Search Ensembl for ENSG00000161249, or use this direct link. You can browse the table or graphic to see the splicing complexity in this gene.
Or, let’s do something different. Let’s group the data by what type of gene it is (e.g., protein coding, pseudogene, etc), get the number of genes in each category, and plot the top 20.
library(ggplot2)
grch38 %>%
group_by(biotype) %>%
summarize(n=n_distinct(ensgene)) %>%
arrange(desc(n)) %>%
head(20) %>%
ggplot(aes(reorder(biotype, n), n)) +
geom_bar(stat="identity") +
xlab("Type") +
theme_bw() +
coord_flip()
This seems very relevant for my daily work. Just wondering - as I'm not as proficient with R - how can I best get b37 instead of b38 data for humans? And another type of question: does it matter when human annotations are b37 and mice than remain b38?
ReplyDeleteThanks!
Please open an issue on GitHub and I'll answer there, thanks.
DeleteHi @swvanderlaan, I've updated the package to include grch37 data as well. see https://github.com/stephenturner/annotables
DeleteI like the use of dplyr etc since it's similar to how I like to work, but just wondering if you'd looked at AnnotationHub and also ensembldb and relaated annotation packages (eg EnsDb.Hsapiens.v75). The benefit of using these, which I think you're trying to get to, is that you're not dependent on a webservice in your analysis script, you just load the library and off you go, and you also explicitly state which version of ensembl you want to use.
ReplyDelete