Getting Genetics Done: Merging data from different files using R

Thursday, April 30, 2009

Merging data from different files using R

A reader asked yesterday how you would merge data from two different files. For example, let's say you have a ped file with genotypes for individuals, and another file that had some other data for some of the individuals in the pedfile, like clinical or environmental covariates. How would you go about automatically putting the clinical data from file 2 into the appropriate rows in file 1? Without using a database, the easiest way is probably using the "merge" function in R that will do the trick with a one-line command. Here's a short tutorial to get you started.

First, start up R on the cheeses simply by typing in the uppercase letter R and hit enter. Alternatively you could use R if it's installed on your computer. Now type in this command to view a small dataset I made up. It's not a real pedfile, but it will do the trick. Make sure to put it all on one line:

read.table("http://people.vanderbilt.edu/~stephen.turner/ggd/ped.txt")

You should see the dataset displayed:

indiv_id SNP1 SNP2
1 1 1 1
2 2 1 1
3 3 0 0
4 4 1 0
5 5 1 1
6 6 1 0
7 7 1 1
8 8 0 1
9 9 1 1
10 10 1 1

We have 10 individuals numbered 1-10 in a column called "indiv_id" that have genotype information at 2 loci. Now let's view another dataset that has fake clinical covariates for a few of those individuals:

read.table("http://people.vanderbilt.edu/~stephen.turner/ggd/meta.txt")

Again, you will see the dataset displayed.

indiv_id cov1 cov2
1 1 1.14 74.6
2 3 4.50 79.4
3 4 0.80 48.2
4 6 1.39 68.1
5 7 3.89 84.8
6 8 0.42 86.1

In this dataset, we have clinical information on the individuals in the ped file above, but only on some of the individuals. Now, let's merge what information we have here with the ped file above.

But first, since all we did above was view the datasets, let's run those commands again, but this time storing the tables in a "data frame" (R parlance for dataset). Type in these two commands. The " <- " is an assignment. It means create a data frame called ped from the following table from the internet, and the same with a data frame called meta.

ped <- read.table("http://people.vanderbilt.edu/~stephen.turner/ped.txt")

meta <- read.table("http://people.vanderbilt.edu/~stephen.turner/meta.txt")

Now if you just type in the name of the data frame, e.g. ped, and hit enter, you will see the data frame displayed. Now it's time for the real magic. Type the following command, and I'll explain what it does below:

merge(ped,meta,by="indiv_id",all=TRUE)

The output should look like this:

indiv_id SNP1 SNP2 cov1 cov2
1 1 1 1 1.14 74.6
2 2 1 1 NA NA
3 3 0 0 4.50 79.4
4 4 1 0 0.80 48.2
5 5 1 1 NA NA
6 6 1 0 1.39 68.1
7 7 1 1 3.89 84.8
8 8 0 1 0.42 86.1
9 9 1 1 NA NA
10 10 1 1 NA NA

This command uses R's "merge" function to combine the data frames you created above, ped and meta, into a single data frame. It does this by joining the two data frames on a common variable called "indiv_id". The all=TRUE option tells R to join the two data frames, even if one has fewer entries than another. You'll see the NA's listed for a few individuals. NA is R's way of coding missing data. If you didn't include the all=TRUE, observations with any missing data for any variable would be deleted!

Finally, if you want to save this data to a file, re-run the command above, assigning it to a variable, then use the write.table command to write it to your home directory.

superpedfile <- merge(ped,meta,by="indiv_id",all=TRUE)

write.table(superpedfile,file="~/superped.txt")

To exit R, type quit(). To do this on your own data, see the help on read.table by typing ?read.table at the R command line. If anybody knows of a better way to do this post it in the comments!

20 comments:

Stephen TurnerApril 30, 2009 at 9:38 AM
I forgot to mention that R's way of encoding missing data, with "NA" is usually not acceptable to other programs. Type in ?write.table at the R command line to get help on this. You'll need to use the na="?" option, where the ? here stands for the character you want to use to represent missing data.
ReplyDelete
Replies
AnonymousMay 19, 2009 at 7:41 AM
This is by far the simplest explanation I have seen about how to use merge, and it did exactly what I wanted. Thank you!
ReplyDelete
Replies
AnonymousJune 22, 2009 at 2:55 PM
This is great trick, it solved my problem paritally. What if the SNP details for 10 individuals are in rows rather than as they are in the first file above? How would you merge the two files & display SNP information for the individuals where SNPs are arranged in rows as in the output?

Thanks
ReplyDelete
Replies
Stephen TurnerJune 22, 2009 at 3:04 PM
Look up the "reshape" command, that might help.

http://www.google.com/search?q=r%20reshape
ReplyDelete
Replies
AnonymousJune 24, 2009 at 4:05 AM
Thanks for this explanation! One question: How do I define what variable to merge on if my columns are not labeled? Can I use column number?

Thanks
ReplyDelete
Replies
Stephen TurnerJune 24, 2009 at 6:59 AM
They usually have names, even if they're just V1, V2, V3... Try the command names(yourdataframe) to get the column names. Also look at the help for merge by typing ?merge. You may have to use the by.x and by.y commands if the columns aren't named the same in both dataframes.
ReplyDelete
Replies
AnonymousJanuary 29, 2010 at 9:10 PM
Hi,

If I have two RData files, how can I merge them? Thanks!
ReplyDelete
Replies
AnonymousFebruary 26, 2010 at 6:22 AM
Hi,
Is it possible to merge two files by a column 1 (ID) but replace lets say column 6 of one file with column 2 of another file. I was using plink plugin to extract my .ped file from genomestudio but for some reason column 6 is where the phenotype status should be but everytime I just have values of 0 instead of 1/0 for case control status and I wanted to know if I could merge the phenotype in the column 6 later on in R or something to have a complete .ped file before i can use plink. Any help would be greatly appreciated! Otherwise any help on how I can get column 6 as the phentoype in my .ped file from the plugin would also be helpful, i've tried but it's always missing.
ReplyDelete
Replies
Stephen TurnerFebruary 26, 2010 at 8:32 AM
Hey - if you are familiar with plink this is much easier than would be in R. First make your pedfile with all 0's for the trait. Then call plink using the --pheno option where you have a separate phenotype file with family id, individual id, and the phenotype. You can read up on the format of this file on the plink website. you can also use multiple phenotypes in analysis with --pheno in combination with --all-pheno. you can also write out a new pedfile with the phenotype you've chosen using --recode or --make-bed. Hope this helps.
ReplyDelete
Replies
AnonymousMarch 10, 2010 at 6:09 AM
Thanks for that will definately give that a go!
ReplyDelete
Replies
AnonymousNovember 15, 2010 at 9:05 AM
I use gPLINK successfully to most analyses but have problems merging to binary filesets due to error report: mis-matching SNPs --check +/- strands?

For which I find no gPLINK command / operation. Is there any?
ReplyDelete
Replies
AnonymousJanuary 5, 2011 at 9:53 AM
Hi,

Thanks, this has been very useful! I have one more question, how do i merge 2 data sets with identical variables adding cases.
ReplyDelete
Replies
CristinaMay 23, 2011 at 7:50 AM
Great easy explanation!!!! Thankyou!!!
ReplyDelete
Replies
AnonymousDecember 15, 2011 at 1:59 AM
Thank you so much ... This is just what i needed
ReplyDelete
Replies
AnonymousFebruary 8, 2012 at 3:36 AM
Thanks for the great explanation. Is it also possible to merge 3 files at once? I tried to do this by including an extra by.z="indiv_id" into the merge function. But this doesn't work. I probably need to do it in 2 steps?
Thanks in advance!
ReplyDelete
Replies
Stephen TurnerFebruary 8, 2012 at 7:43 AM
probably should use a nested merge. E.g:

merge(dat3, merge(dat1,dat2,by="id",all=T), by="id", all=T)
ReplyDelete
Replies
Stephen TurnerFebruary 8, 2012 at 7:44 AM
To add to that, as always, you should create a fake dataset to test this out to make sure it worked as you expected before moving to a larger dataset where problems may be more insidious.
ReplyDelete
Replies
Duy Ngoc DoMay 31, 2012 at 5:12 AM
thanks, it is a simple explanation, easy to understand. However, concerning the big database, for instance, I got records from feeding machine for feed intake and time spending in feeders of pigs. Everyday, I got one Zip file containing information about animal ID, feed consumption, time spending...Data recorded over many years. I am looking for the scripts to merge all information from different zip file.
Thank again
ReplyDelete
Replies
rahaSeptember 8, 2012 at 1:42 PM
nicely written! thanks
ReplyDelete
Replies
bijalDecember 27, 2012 at 10:02 AM
hey anyone guide me how to make .ped file
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

This blog has moved!

Thursday, April 30, 2009

Merging data from different files using R

20 comments: