Getting Genetics Done: SQL

Showing posts with label SQL. Show all posts

Wednesday, November 20, 2013

Using Database Joins to Compare Results Sets

One of the most powerful tools you can learn to use in genomics research is a relational database system, such as MySQL. These systems are fairly easy to setup and use, and provide users the ability to organize and manipulate data and statistical results with simple commands. As a graduate student (during the height of GWAS), this single skill quickly turned me into an “expert”. By storing the SNP lists for common GWAS platforms and some simple annotations from the UCSC and ENSEMBL databases, I could quickly provide lists of SNPs within a gene or collection of genes, or pull a list of SNPs that overlap two genotyping platforms. We even developed database modules that allowed us to easily define LD blocks within a database query (called LD-Spline).

Once you learn the basics of defining tables and loading data, you can start to join tables together, matching them on a common field. This is where the true power of a database system lies. Suppose you have two sets of results from a PLINK analysis, one from a discovery dataset and another from a replication. Rather than clumsily matching two sets of results within a spreadsheet application, a few simple queries within MySQL will tell you which SNPs are in common between the two sets, which were not found in the replication set, which SNPs were significant in the first set but not the second, etc.

The concept that makes these operations work is the idea of a primary key. A primary key is some field of a dataset that uniquely identifies each row of the table/dataset. In the above example of PLINK results, a good primary key might be the RS number of the SNP. You can also uniquely identify rows based on two columns, a concept known as a composite key – for example, the chromosome AND position of a SNP. Establishing a primary key allows MySQL to keep data stored in a sorted order and allows the matching operations for table joins to be performed much faster.

Having this sorted order from a primary key prevents MySQL from having to scan an entire table to find a specific value. Much like the index of a book, a primary key lets MySQL find a value within a table very quickly. If a table is small, having a primary key is not as critical; the computer can quickly scan the entire contents of the table for any query. If the table is large, however, a full scan of the entire table could be a costly operation, and the number of table scans required increases when doing a join. For example, if we join tables for our discovery and replication results sets, the database system will take the RS number for each entry from the discovery table and attempt to find a matching RS number in the replication table. If the replication table has the RS number as a primary key, the database system can very quickly find this entry. There is a fantastic post on the various types of database joins here.

Let's start by creating our database tables. A typical PLINK association output contains 12 columns (CHR, SNP, BP, A1, TEST, NMISS, OR, SE, L95, U95, STAT, P). In these tables, we've established the SNP column as the primary key. Recall that the primary key must uniquely identify each row of the table, so if there are multiple rows per SNP -- sometimes PLINK will report multiple TEST rows per SNP. If this is the case, we may need to either establish a composite key using PRIMARY KEY (`snp`,`test`), or simply eliminate these rows from the data file using an AWK command.

CREATE TABLE `discovery` (
 `chr` varchar(1),
        `snp` varchar(32),
        `bp` int, 
        `a1` varchar(1),
        `test` varchar(3),
        `nmiss` int,
        `or` float,
        `se` float,
        `l95` float,
        `u95` float,
        `stat` float,
 `p` float,
 PRIMARY KEY (`snp`)
);

CREATE TABLE `replication` (
       `chr` varchar(1),
       `snp` varchar(32),
       `bp` int,
       `a1` varchar(1),
       `test` varchar(3),
       `nmiss` int,
       `or` float,
       `se` float,
       `l95` float,
       `u95` float,
       `stat` float,
       `p` float,
       PRIMARY KEY (`snp`)
);

Before loading our data into these tables, a little pre-processing is helpful. To ensure that results are easy to read on the screen, PLINK developers used leading spaces in the column format for many PLINK outputs. These make loading the results into a database difficult. We can resolve this by running a simple SED command:

sed -r -e 's/\s+/\t/' -e 's/^\t//g' input-file.assoc.logistic > discovery.load

This will convert all spaces to tabs and will eliminate the leading spaces and write the results to a new file, discovery.load. Now lets load this file into our table, and repeat the procedure for our replication file.

LOAD DATA LOCAL INFILE '{PathToFile}/discovery.load' INTO TABLE 
discovery FIELDS TERMINATED BY '\t' IGNORE 1 LINES;

Now we should have two MySQL database tables with the discovery and results sets loaded into them. We can view their contents with a simple select statement. Then, finally, we can join these two tables to easily compare the results from the discovery and replication analyses.

SELECT * FROM discovery INNER JOIN replication ON 
discovery.snp = replication.snp;

The syntax is simple: select a set of fields -- in this case all of them (represented by the *) -- from the first table (discovery), matching each row from this table to a row in the second table (replication) where the discovery SNP equals the replication SNP. MySQL also supports a table alias which can make these queries a bit easier to write. An alias is simply a label specified after a table name which can be used in the rest of the query in place of the full table name. For example, in the query below, we use a for the discovery table and b for the replication table.

SELECT * FROM discovery a INNER JOIN replication b ON 
a.snp = b.snp;

With practice and additional data, join operations can be used to annotate results by gene or region, and to match these to results from other studies, such as the NHGRI GWAS catalog.

Monday, August 12, 2013

Understanding the ENSEMBL Schema

ENSEMBL is a frequently used resource for various genomics and transcriptomics tasks. The ENSEMBL website and MART tools provide easy access to their rich database, but ENSEMBL also provides flat-file downloads of their entire database and a public MySQL portal. You can access this using the MySQL Workbench using the following:

Host: useastdb.ensembl.org
User: anonymous

Once inside, you can get a sense for what the ENSEMBL schema (or data model) is like. First, it’s important to understand the ENSEMBL ID system. Most of the primary entities in the ENSEMBL database (genes, exons, transcripts, proteins) have a formal, stable identifier (beginning with ENSG, ENSE, ENST, and ENSP respectively) that does not change from build to build. These entries can be found in the gene_stable_id tables. All of these entities also have an internal identifier (an integer). Once you have an internal ID for the entity of interest, details of the entity can be found in the genes, exons, transcripts, and translations (proteins) table. For example, the following query will retrieve a list of all transcripts and their exons for a given gene.

SELECT * FROM gene_stable_id a
inner join gene b on a.gene_id = b.gene_id
inner join transcript c on b.gene_id = c.gene_id
inner join exon_transcript d on c.transcript_id = d.transcript_id
inner join exon e on d.exon_id = e.exon_id
inner join transcript_stable_id f on c.transcript_id = f.transcript_id
inner join exon_stable_id g on e.exon_id = g.exon_id

The exon_transcript table contains a mapping of each exon to any transcripts containing it, and also contains a rank to indicate which exon it is relative to a given transcript. To retrieve exons for a list of genes by their ENSEMBL IDs, these could be loaded into a table and joined to the gene_stable_id table in the query above. To pull the build 37 chromosome and coordinates for an exon, use the following:

Select a.exon_id, b.name, a.seq_region_start, a.seq_region_end from exon a
inner join seq_region b on a.seq_region_id = b.seq_region_id
inner join coord_system c on b.coord_system_id = c.coord_system_id
where c.version = "GRCh37";

In this query, the seq_region table contains a field called name that identifies the contig to which the coordinates refer, in this case the chromosome number.

There are also extensive cross-references in the ENSEMBL database. To retrieve alternate identifiers for a set of transcripts, execute the following:

select * from transcript_stable_id a
inner join transcript b on a.transcript_id = b.transcript_id
inner join object_xref c on b.transcript_id = c.ensembl_id
inner join xref d on c.xref_id = d.xref_id
inner join external_db e on d.external_db_id = e.external_db_id
where ensembl_object_type = "Transcript"
limit 20;

ENSEMBL organizes cross-references (xrefs) for all entity types into a single table object_xref. This table contains an ensemble_object_type field that is a “Transcript”, “Gene”, or “Translation”, and an ensemble_id that matches either a gene_id, transcript_id, or a translation_id. Replace “transcript” in the above query with “gene” or “translation” to retrieve gene or protein cross-references. A list of all external cross-reference sources can be found by querying:

Select db_name from external_db;

There is a great deal of information within the ENSEMBL database that can be accessed using SQL, which for some types of operations is easier than using the MART or web interface. Full details of the ENSEMBL schema can be found here (http://useast.ensembl.org/info/docs/api/core/core_schema.html)

Tuesday, March 19, 2013

Software Carpentry Bootcamp at University of Virginia

A couple of weeks ago I, with the help of others here at UVA, organized a Software Carpentry bootcamp, instructed by Steve Crouch, Carlos Anderson, and Ben Morris. The day before the course started, Charlottesville was racked by nearly a foot of snow, widespread power outages, and many cancelled incoming flights. Luckily our instructors arrived just in time, and power was (mostly) restored shortly before the boot camp started. Despite the conditions, the course was very well-attended.

Software Carpentry's aim is to teach researchers (usually graduate students) basic computing concepts and skills so that they can get more done in less time, and with less pain. They're a volunteer organization funded by Mozilla and the Sloan foundation, and led this two-day bootcamp completely free of charge to us.

The course started out with a head-first dive into Unix and Bash scripting, followed by a tutorial on automation with Make, concluding the first day with an introduction to Python. The second day covered version control with git, Python code testing, and wrapped up with an introduction to databases and SQL. At the conclusion of the course, participants offered near-universal positive feedback, with the git and Make tutorials being exceptionally popular.

Software Carpentry's approach to teaching these topics is unlike many others that I've seen. Rather than lecturing on for hours, the instructors inject very short (~5 minute) partnered exercises between every ~15 minutes of instruction in 1.5 hour sessions. With two full days of intensive instruction and your computer in front of you, it's all too easy to get distracted by an email, get lost in your everyday responsibilities, and zone out for the rest of the session. The exercises keep participants paying attention and accountable to their partner.

All of the bootcamp's materials are freely available:

Unix and Bash: https://github.com/redcurry/bash_tutorial
Python Introduction: https://github.com/redcurry/python_tutorial
Git tutorial: https://github.com/redcurry/git_tutorial
Databases & SQL: https://github.com/bendmorris/swc_databases
Everything else: http://users.ecs.soton.ac.uk/stc/SWC/tutorial-materials-virginia.zip

Perhaps more relevant to a broader audience are the online lectures and materials available on the Software Carpentry Website, which include all the above topics, as well as many others.

We capped the course at 50, and had 95 register within a day of opening registration, so we'll likely do this again in the future. I sit in countless meetings where faculty lament how nearly all basic science researchers enter grad school or their postdoc woefully unprepared for this brave new world of data-rich high-throughput science. Self-paced online learning works well for some, but if you're in a department or other organization that could benefit from a free, on-site, intensive introduction to the topics listed above, I highly recommend contacting Software Carpentry and organizing your own bootcamp.

Finally, when organizing an optional section of the course, we let participants vote whether they preferred learning number crunching with NumPy, or SQL/databases; SQL won by a small margin. However, Katherine Holcomb in UVACSE has graciously volunteered to teach a two-hour introduction to NumPy this week, regardless of whether you participated in the boot camp (although some basic Python knowledge is recommended). This (free) short course is this Thursday, March 21, 2-4pm, in the same place as the bootcamp (Brown Library Classroom in Clark Hall). Sign up here.

Monday, May 9, 2011

Accessing Databases From R

Jeffrey Breen put together a useful slideshow on accessing databases from R. I use RODBC every single day to access my own local MySQL server from R. I've had trouble with RMySQL, so I've always used RODBC instead after setting up my localhost MySQL server as a Windows data source. Once you get accustomed to accessing your data directly with SQL queries rather than dumping files you'll wonder why you waited so long. After you get a handle on using SQL you can take it one step further and use SQL queries on data.frames with the sqldf package for quick and easy data management and group summaries (which happens to be way faster than plyr, aggregate, doBy, and data.table for simple grouping tasks).

Also, if you're new to databases, check out Will's previous post on how to store and organize results from a genetic study using MySQL, or take a look at the w3schools SQL tutorial.

Greater Boston UseR Group Files via (@RevoDavid)

Tuesday, May 25, 2010

Use SQL queries to manipulate data frames in R with sqldf package

I've covered a few topics in the past including the plyr package, which is kind of like "GROUP BY" for R, and the merge function for merging datasets. I only recently found the sqldf package for R, and it's already one of the most useful packages I've ever installed. The main function in the package is sqldf(), which takes a quoted string as an argument. You can treat data frames as tables as if they were in a relational database. You can use some of the finer aspects of SQL like the INNER JOIN or the subquery, which are extremely difficult operations to mimic using standard R programming. While this isn't an SQL tutorial, try out some of these commands to see what sqldf can do for you. Read more about the sqldf package here.

> # install the package
> install.packages("sqldf")
> 
> #load it
> library(sqldf)
> 
> # set the random seed
> set.seed(42)
> 
> #generate some data
> df1 = data.frame(id=1:10,class=rep(c("case","ctrl"),5))
> df2 = data.frame(id=1:10,cov=round(runif(10)*10,1))
> 
> #look at the data
> df1
   id class
1   1  case
2   2  ctrl
3   3  case
4   4  ctrl
5   5  case
6   6  ctrl
7   7  case
8   8  ctrl
9   9  case
10 10  ctrl
> df2
   id cov
1   1 9.1
2   2 9.4
3   3 2.9
4   4 8.3
5   5 6.4
6   6 5.2
7   7 7.4
8   8 1.3
9   9 6.6
10 10 7.1
> 
> # do an inner join
> sqldf("select * from df1 join df2 on df1.id=df2.id")
   id class id cov
1   1  case  1 9.1
2   2  ctrl  2 9.4
3   3  case  3 2.9
4   4  ctrl  4 8.3
5   5  case  5 6.4
6   6  ctrl  6 5.2
7   7  case  7 7.4
8   8  ctrl  8 1.3
9   9  case  9 6.6
10 10  ctrl 10 7.1
> 
> # where clauses
> sqldf("select * from df1 join df2 on df1.id=df2.id where class='case'")
  id class id cov
1  1  case  1 9.1
2  3  case  3 2.9
3  5  case  5 6.4
4  7  case  7 7.4
5  9  case  9 6.6
> 
> # lots of sql fun
> sqldf("select df1.id, df2.cov as covariate from df1 join df2 on df1.id=df2.id where class='case' and cov>3 order by cov")
  id covariate
1  5       6.4
2  9       6.6
3  7       7.4
4  1       9.1

Tuesday, April 27, 2010

Discovering New Disease Genes Using Orthologous Phenotypes in Model Organisms

Check out this paper in PNAS and the corresponding synopsis in the New York Times. The authors take a unique approach to finding genes likely to be associated with human traits using orthologous phenotypes in model organisms, or phenologs. The idea is simple. The authors have a database of ~2000 disease associated genes in humans. To this database they added another ~200,000 gene-trait associations in model organisms including mice, yeast, worm, and plants. Then they look for overlapping sets of orthologous genes from these organisms to identify phenotypes in the model organisms. The related genes causing orthologous phenotypes, or phenologs, are predictive of genes causing disease in humans. For example, the authors found genes responsible for angiogenesis using yeast, breast cancer associated genes in C. elegans, and even genes responsible for deafness using plants.

I remember seeing a talk about this at this year's Pacific Symposium in Biocomputing. You can learn more about the methodology at phenologs.org, and download all the original data used in the paper and build your own phenolog database, which could be very useful for disease gene prediction or prioritization of GWAS hits for followup.

PNAS: Systematic discovery of nonobvious human disease models through orthologous phenotypes

New York Times: The Search for Genes Leads to Unexpected Places

phenologs.org: Systematic discovery of non-obvious disease models and candidate genes

Thursday, February 11, 2010

Using MySQL to Store and Organize Results from Genetic Analyses

This is the first in a series of posts on how to use MySQL with genetic data analysis. MySQL is a very popular, freely available database management system that is easily installed on desktop computers (or Linux servers). The "SQL" in MySQL stands for Structured Query Language, which is by my humble estimation the most standardized way to store and access information in the known universe. SQL is an easy language to learn, and is fundamentally based on set operations such as unions and intersects.

Over the course of this tutorial, I'll illustrate how to use MySQL to load, store, access, and sort SNP statistics computed by PLINK. To begin, you will need to download and install MySQL on your local machine, and you'll also need to install the MySQL GUI tools.

You can think of MySQL as a massive spreadsheet application like Excel - The basic unit within the database is a table, consisting of rows (sometimes called tuples) and columns (generally called fields). A collection of tables is called a schema (the plural form is schemata) - schemata are typically used for organizational purposes.

After installing MySQL Server and the MySQL GUI Tools, start the MySQL Query Browser. The Query Browser will prompt you for connection information - this information will depend on where you installed the MySQL Server application. Assuming that you installed the Server on your current computer, you would enter "localhost" as the host. Also, when installing MySQL Server, you were required to enter a password for the "root" account (which is the primary administrative account). Enter "root" as the username and the password you specified during setup.

Now you should see the MySQL Query Browser, which has three main panes: the SQL query area, the schemata list, and the results pane. They schemata list is useful for browsing what the database contains, and allows you to see the field names and tables for each schema. The query editor allows you to enter and execute an SQL statement, and the results of that statement are returned in the results pane, which is essentially a read-only spreadsheet.

Lets start by creating a new schema to hold the tables we will make in this tutorial. In the Query Browser, type:

CREATE DATABASE Sandbox;

Notice the semi-colon at the end of the statement - this tells the SQL engine that it has reached the end of an SQL statement. You can execute this statement two ways: 1. Hit Control-Enter, or 2. Click the button with the green lightning bolt. This statement just created a new schema in our database called Sandbox. You can see the new schema in the schemata browser if you right-click inside the schemata tab and choose refresh from the menu. The active schema in the MySQL Query Browser is always shown as bold. You can select a new active schema simply by double-clicking it in the schema browser.

Now let's create a table in the Sandbox schema. If you haven't already, double-click the Sandbox schema in the schemata browser to make it the active schema, then execute the following statement:

CREATE TABLE Tutorial (name varchar(30), address varchar(150), zipcode int);

This statement created a table called Tutorial in the Sandbox schema that has 3 fields: a character-based field with a max of 30 characters called "name", a character-based field with a max of 150 characters called "address", and an integer-based field called "zipcode".

Now let's load some data into this table. Most genetic analysis applications will output a text file with analysis results -- with this in mind, we'll create a text file containing several names, addresses, and zip codes to load into our new table:

Phillip J. Fry,Robot Arms Apartments #455,65774
Zapp Brannigan,Starship Nimbus Captain's Quarters,45542
Doctor Zoidberg,335 Medical Corporation Way,72552
Hubert J. Farnsworth,Planet Express Delivery Company,88754

Copy and paste this text into a text file called "futurama_addresses.txt". Now execute the following statement:

LOAD DATA LOCAL INFILE "C:/futurama_addresses.txt" INTO TABLE Tutorial FIELDS TERMINATED BY ',' (name, address, zipcode);

The keyword LOCAL tells the MySQL browser to use files on the computer currently running the MySQL client (in this case the MySQL query browser) rather than the MySQL Server. The FIELDS TERMINATED BY ',' indicates that the file is comma-delimited. If the file were a tab-separated file, the statement would be FIELDS TERMINATED BY '\t'. The last portion inside parentheses tells the browser the order the fields appear in the text file, so if the zipcode were listed first in the text file, the statement would be (zipcode, name, address). Also, note the use of a forward-slash in the file location -- this is because MySQL designers use proper UNIX file locations rather than the heretic DOS-style format with a back-slash.

Now that the data is loaded, lets confirm that it is there and properly formatted in the database. Execute the following statement:

SELECT * FROM Tutorial;

This will return all rows from the Tutorial table to the results pane of the query browser. You should see all four rows and all three columns of our original text file, now neatly organized in our database.

In the next post, I'll show you how to do more sophisticated SELECT statements. Please comment if anything is unclear and I'll do my best to clarify.

This blog has moved!