Monday, August 12, 2013

Understanding the ENSEMBL Schema

ENSEMBL is a frequently used resource for various genomics and transcriptomics tasks.  The ENSEMBL website and MART tools provide easy access to their rich database, but ENSEMBL also provides flat-file downloads of their entire database and a public MySQL portal.  You can access this using the MySQL Workbench using the following:
Host:     useastdb.ensembl.org
User:     anonymous
Once inside, you can get a sense for what the ENSEMBL schema (or data model) is like.  First, it’s important to understand the ENSEMBL ID system.  Most of the primary entities in the ENSEMBL database (genes, exons, transcripts, proteins) have a formal, stable identifier (beginning with ENSG, ENSE, ENST, and ENSP respectively) that does not change from build to build.  These entries can be found in the gene_stable_id tables.  All of these entities also have an internal identifier (an integer).  Once you have an internal ID for the entity of interest, details of the entity can be found in the genes, exons, transcripts, and translations (proteins) table. For example, the following query will retrieve a list of all transcripts and their exons for a given gene.
SELECT * FROM gene_stable_id a
inner join gene b on a.gene_id = b.gene_id
inner join transcript c on b.gene_id = c.gene_id
inner join exon_transcript d on c.transcript_id = d.transcript_id
inner join exon e on d.exon_id = e.exon_id
inner join transcript_stable_id f on c.transcript_id = f.transcript_id
inner join exon_stable_id g on e.exon_id = g.exon_id
The exon_transcript table contains a mapping of each exon to any transcripts containing it, and also contains a rank to indicate which exon it is relative to a given transcript.  To retrieve exons for a list of genes by their ENSEMBL IDs, these could be loaded into a table and joined to the gene_stable_id table in the query above.  To pull the build 37 chromosome and coordinates for an exon, use the following:
Select a.exon_id, b.name, a.seq_region_start, a.seq_region_end from exon a
inner join seq_region b on a.seq_region_id = b.seq_region_id
inner join coord_system c on b.coord_system_id = c.coord_system_id
where c.version = "GRCh37";
In this query, the seq_region table contains a field called name that identifies the contig to which the coordinates refer, in this case the chromosome number. 

There are also extensive cross-references in the ENSEMBL database.  To retrieve alternate identifiers for a set of transcripts, execute the following: 
select * from transcript_stable_id a
inner join transcript b on a.transcript_id = b.transcript_id
inner join object_xref c on b.transcript_id = c.ensembl_id
inner join xref d on c.xref_id = d.xref_id
inner join external_db e on d.external_db_id = e.external_db_id
where ensembl_object_type = "Transcript"
limit 20;
ENSEMBL organizes cross-references (xrefs) for all entity types into a single table object_xref.  This table contains an ensemble_object_type field that is a “Transcript”, “Gene”, or “Translation”, and an ensemble_id that matches either a gene_id, transcript_id, or a translation_id.  Replace “transcript” in the above query with “gene” or “translation” to retrieve gene or protein cross-references.  A list of all external cross-reference sources can be found by querying: 
Select db_name from external_db;
There is a great deal of information within the ENSEMBL database that can be accessed using SQL, which for some types of operations is easier than using the MART or web interface.  Full details of the ENSEMBL schema can be found here (http://useast.ensembl.org/info/docs/api/core/core_schema.html)


1 comment:

  1. Just a word of caution from my experience: Ensembl does not guarantee the stability of their database schema from release to release.
    In case you're trying build a data extraction pipeline, it may be safer to rely on the API or Biomart webservice.
    Julien

    ReplyDelete

Note: Only a member of this blog may post a comment.

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.