Thursday, February 11, 2010

Using MySQL to Store and Organize Results from Genetic Analyses

This is the first in a series of posts on how to use MySQL with genetic data analysis. MySQL is a very popular, freely available database management system that is easily installed on desktop computers (or Linux servers). The "SQL" in MySQL stands for Structured Query Language, which is by my humble estimation the most standardized way to store and access information in the known universe. SQL is an easy language to learn, and is fundamentally based on set operations such as unions and intersects.

Over the course of this tutorial, I'll illustrate how to use MySQL to load, store, access, and sort SNP statistics computed by PLINK. To begin, you will need to download and install MySQL on your local machine, and you'll also need to install the MySQL GUI tools.

You can think of MySQL as a massive spreadsheet application like Excel - The basic unit within the database is a table, consisting of rows (sometimes called tuples) and columns (generally called fields). A collection of tables is called a schema (the plural form is schemata) - schemata are typically used for organizational purposes.

After installing MySQL Server and the MySQL GUI Tools, start the MySQL Query Browser. The Query Browser will prompt you for connection information - this information will depend on where you installed the MySQL Server application. Assuming that you installed the Server on your current computer, you would enter "localhost" as the host. Also, when installing MySQL Server, you were required to enter a password for the "root" account (which is the primary administrative account). Enter "root" as the username and the password you specified during setup.


Now you should see the MySQL Query Browser, which has three main panes: the SQL query area, the schemata list, and the results pane. They schemata list is useful for browsing what the database contains, and allows you to see the field names and tables for each schema. The query editor allows you to enter and execute an SQL statement, and the results of that statement are returned in the results pane, which is essentially a read-only spreadsheet.

Lets start by creating a new schema to hold the tables we will make in this tutorial. In the Query Browser, type:

CREATE DATABASE Sandbox;
Notice the semi-colon at the end of the statement - this tells the SQL engine that it has reached the end of an SQL statement. You can execute this statement two ways: 1. Hit Control-Enter, or 2. Click the button with the green lightning bolt. This statement just created a new schema in our database called Sandbox. You can see the new schema in the schemata browser if you right-click inside the schemata tab and choose refresh from the menu. The active schema in the MySQL Query Browser is always shown as bold. You can select a new active schema simply by double-clicking it in the schema browser.

Now let's create a table in the Sandbox schema. If you haven't already, double-click the Sandbox schema in the schemata browser to make it the active schema, then execute the following statement:
CREATE TABLE Tutorial (name varchar(30), address varchar(150), zipcode int);

This statement created a table called Tutorial in the Sandbox schema that has 3 fields: a character-based field with a max of 30 characters called "name", a character-based field with a max of 150 characters called "address", and an integer-based field called "zipcode".

Now let's load some data into this table. Most genetic analysis applications will output a text file with analysis results -- with this in mind, we'll create a text file containing several names, addresses, and zip codes to load into our new table:

Phillip J. Fry,Robot Arms Apartments #455,65774
Zapp Brannigan,Starship Nimbus Captain's Quarters,45542
Doctor Zoidberg,335 Medical Corporation Way,72552
Hubert J. Farnsworth,Planet Express Delivery Company,88754
Copy and paste this text into a text file called "futurama_addresses.txt". Now execute the following statement:

LOAD DATA LOCAL INFILE "C:/futurama_addresses.txt" INTO TABLE Tutorial FIELDS TERMINATED BY ',' (name, address, zipcode);

The keyword LOCAL tells the MySQL browser to use files on the computer currently running the MySQL client (in this case the MySQL query browser) rather than the MySQL Server. The FIELDS TERMINATED BY ',' indicates that the file is comma-delimited. If the file were a tab-separated file, the statement would be FIELDS TERMINATED BY '\t'. The last portion inside parentheses tells the browser the order the fields appear in the text file, so if the zipcode were listed first in the text file, the statement would be (zipcode, name, address). Also, note the use of a forward-slash in the file location -- this is because MySQL designers use proper UNIX file locations rather than the heretic DOS-style format with a back-slash.

Now that the data is loaded, lets confirm that it is there and properly formatted in the database. Execute the following statement:

SELECT * FROM Tutorial;

This will return all rows from the Tutorial table to the results pane of the query browser. You should see all four rows and all three columns of our original text file, now neatly organized in our database.

In the next post, I'll show you how to do more sophisticated SELECT statements. Please comment if anything is unclear and I'll do my best to clarify.

3 comments:

  1. I think that the MySQL GUItools are being streamlined into MySQL Workbench.

    http://wb.mysql.com/

    Workbench 5.2.15-Beta works pretty well, but you might want to hold off until it's out of beta. I've gotten more than one crash trying it out. But once those kinks have been worked out I think it will be a nice integration.

    ReplyDelete
  2. Will, thank you for writing this short tutorial. It's been my starting point for trying out mySQL for genetic database. A lot of things right now seems to be very different from the time you wrote this. And, it took some guesses to match what is shown here with what currently is in mySQL workbench. The GUI seems nicer now, but they still keep that scripting window available. Will there still be any follow up posts after this? I still think it's useful.

    ReplyDelete
  3. Bhoom - not sure if Will's planning on following this up or not, but here are some other resources for you. The MySQL documentation is excellent, and they have some documentation about Workbench. you're right, you probably want to edit the SQL script in Workbench (http://dev.mysql.com/doc/workbench/en/wb-getting-started-tutorial.html)

    Also, I looked around and there's a serious dearth of resources for how to intelligently use a relational database to store genetic data. Will and I are writing a book chapter that covers this and we'd be happy to share once it's published, but for now, check out these slides: http://osastatistician.com/Stat6910/Dataman.pdf

    ReplyDelete

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.