ENSEMBL is a frequently used resource for various genomics
and transcriptomics tasks. The ENSEMBL
website and MART tools provide easy access to their rich database, but ENSEMBL
also provides flat-file downloads of their entire database and a public MySQL
portal. You can access this using the
MySQL Workbench using the following:
Host: useastdb.ensembl.org
User: anonymous
Once inside, you can get a sense for what the ENSEMBL schema
(or data model) is like. First, it’s
important to understand the ENSEMBL ID system.
Most of the primary entities in the ENSEMBL database (genes, exons,
transcripts, proteins) have a formal, stable identifier (beginning with ENSG,
ENSE, ENST, and ENSP respectively) that does not change from build to
build. These entries can be found in the gene_stable_id tables. All of these
entities also have an internal identifier (an integer). Once you have an internal
ID for the entity of interest, details of the entity can be found in the genes,
exons, transcripts, and translations (proteins) table. For example, the following query will retrieve a list of all
transcripts and their exons for a given gene.
SELECT *
FROM gene_stable_id a
inner join
gene b on a.gene_id = b.gene_id
inner join
transcript c on b.gene_id = c.gene_id
inner join
exon_transcript d on c.transcript_id = d.transcript_id
inner join
exon e on d.exon_id = e.exon_id
inner join
transcript_stable_id f on c.transcript_id = f.transcript_id
inner join
exon_stable_id g on e.exon_id = g.exon_id
The exon_transcript table contains a mapping of each exon to
any transcripts containing it, and also contains a rank to indicate which exon it is relative to a given
transcript. To retrieve exons for a list
of genes by their ENSEMBL IDs, these could be loaded into a table and joined to
the gene_stable_id table in the query above.
To pull the build 37 chromosome and coordinates for an exon, use the
following:
Select
a.exon_id, b.name, a.seq_region_start, a.seq_region_end from exon a
inner join
seq_region b on a.seq_region_id = b.seq_region_id
inner join
coord_system c on b.coord_system_id = c.coord_system_id
where
c.version = "GRCh37";
In this
query, the seq_region table contains a field called name that identifies the contig to which the coordinates refer, in
this case the chromosome number.
There are
also extensive cross-references in the ENSEMBL database. To retrieve alternate identifiers for a set
of transcripts, execute the following:
select *
from transcript_stable_id a
inner join
transcript b on a.transcript_id = b.transcript_id
inner join
object_xref c on b.transcript_id = c.ensembl_id
inner join
xref d on c.xref_id = d.xref_id
inner join
external_db e on d.external_db_id = e.external_db_id
where
ensembl_object_type = "Transcript"
limit 20;
ENSEMBL
organizes cross-references (xrefs) for all entity types into a single table
object_xref. This table contains an ensemble_object_type
field that is a “Transcript”, “Gene”, or “Translation”, and an ensemble_id that
matches either a gene_id, transcript_id, or a translation_id. Replace “transcript” in the above query with “gene”
or “translation” to retrieve gene or protein cross-references. A list of all external cross-reference
sources can be found by querying:
Select db_name
from external_db;