ENSEMBL is a frequently used resource for various genomics
and transcriptomics tasks. The ENSEMBL
website and MART tools provide easy access to their rich database, but ENSEMBL
also provides flat-file downloads of their entire database and a public MySQL
portal. You can access this using the MySQL Workbench using the following:
Host: useastdb.ensembl.org
User: anonymous
Once inside, you can get a sense for what the ENSEMBL schema
(or data model) is like. First, it’s
important to understand the ENSEMBL ID system.
Most of the primary entities in the ENSEMBL database (genes, exons,
transcripts, proteins) have a formal, stable identifier (beginning with ENSG,
ENSE, ENST, and ENSP respectively) that does not change from build to
build. These entries can be found in the gene_stable_id tables. All of these
entities also have an internal identifier (an integer). Once you have an internal
ID for the entity of interest, details of the entity can be found in the genes,
exons, transcripts, and translations (proteins) table. For example, the following query will retrieve a list of all
transcripts and their exons for a given gene.
SELECT * FROM gene_stable_id a
inner join gene b on a.gene_id = b.gene_id
inner join transcript c on b.gene_id = c.gene_id
inner join exon_transcript d on c.transcript_id = d.transcript_id
inner join exon e on d.exon_id = e.exon_id
inner join transcript_stable_id f on c.transcript_id = f.transcript_id
inner join exon_stable_id g on e.exon_id = g.exon_id
The exon_transcript table contains a mapping of each exon to
any transcripts containing it, and also contains a rank to indicate which exon it is relative to a given
transcript. To retrieve exons for a list
of genes by their ENSEMBL IDs, these could be loaded into a table and joined to
the gene_stable_id table in the query above.
To pull the build 37 chromosome and coordinates for an exon, use the
following:
Select a.exon_id, b.name, a.seq_region_start, a.seq_region_end from exon a
inner join seq_region b on a.seq_region_id = b.seq_region_id
inner join coord_system c on b.coord_system_id = c.coord_system_id
where c.version = "GRCh37";
In this
query, the seq_region table contains a field called name that identifies the contig to which the coordinates refer, in
this case the chromosome number.
There are
also extensive cross-references in the ENSEMBL database. To retrieve alternate identifiers for a set
of transcripts, execute the following:
select * from transcript_stable_id a
inner join transcript b on a.transcript_id = b.transcript_id
inner join object_xref c on b.transcript_id = c.ensembl_id
inner join xref d on c.xref_id = d.xref_id
inner join external_db e on d.external_db_id = e.external_db_id
where ensembl_object_type = "Transcript"
limit 20;
ENSEMBL
organizes cross-references (xrefs) for all entity types into a single table
object_xref. This table contains an ensemble_object_type
field that is a “Transcript”, “Gene”, or “Translation”, and an ensemble_id that
matches either a gene_id, transcript_id, or a translation_id. Replace “transcript” in the above query with “gene”
or “translation” to retrieve gene or protein cross-references. A list of all external cross-reference
sources can be found by querying:
Select db_name from external_db;
There is a
great deal of information within the ENSEMBL database that can be accessed
using SQL, which for some types of operations is easier than using the MART or
web interface. Full details of the
ENSEMBL schema can be found here (http://useast.ensembl.org/info/docs/api/core/core_schema.html)
Just a word of caution from my experience: Ensembl does not guarantee the stability of their database schema from release to release.
ReplyDeleteIn case you're trying build a data extraction pipeline, it may be safer to rely on the API or Biomart webservice.
Julien