Tuesday, August 3, 2010

Convert PLINK output to tab or comma delimited CSV using Perl

Last week Will showed you a bash script version of a sed command covered here a while back that would convert PLINK output from the default variable space-delimited format to a more database-loading-friendly tab or comma delimited file. A commenter asked how to do this on windows, so I'll share the way I do this using a perl script which you can use on windows after installing ActivePerl. First copy the code below and save the file as cleanplink.pl somewhere in your path.

#!/usr/bin/perl

# cleanplink.pl
# (c) Stephen D. Turner 2010 http://www.stephenturner.us/
# This is free open-source software.
# See http://gettinggeneticsdone.blogspot.com/p/copyright.html

my $help = "\nUsage: $0 <input whitespace file> <tab or comma>\n\n";
die $help if @ARGV<2;

$delimiter=pop(@ARGV);
die $help unless ($delimiter=~/tab/i|$delimiter=~/comma/i);
@inputfiles=@ARGV;

if ($delimiter =~ /comma/i) {
    foreach (@inputfiles) {

        open (IN,"<$_");
        open (OUT,">$_.csv");
        while (<IN>) {
            chomp;
            $_ =~ s/^\s+//;  #Trim whitespace at beginning
            $_ =~ s/\s+$//;  #Trim whitespace at end
            $_ =~ s/\s+/,/g; #Remaining whitespace into commas
            #$_ =~ s/NA/-9/g;#If you want to recode NA as -9
            print OUT "$_\n";
        }
    }
} elsif ($delimiter =~ /tab/i) {
    foreach (@inputfiles) {
        open (IN,"<$_");
        open (OUT,">$_.tab");
        while (<IN>) {
            chomp;
            $_ =~ s/^\s+//;  #Trim whitespace at beginning
            $_ =~ s/\s+$//;  #Trim whitespace at end
            $_ =~ s/\s+/\t/g;#Remaining whitespace into commas
            #$_ =~ s/NA/-9/g;#If you want to recode NA as -9
            print OUT "$_\n";
        }
    }
} else {
    die $help;
}
Run the program with the first argument(s) as the PLINK output file(s) you want to convert, and the last argument being either "comma" or "tab" without the quotes. It'll create another file in the current directory ending with either .csv or .tab. Look below to see cleanplink.pl in action.

turnersd@provolone:~/tmp$ ls
plink.qassoc
turnersd@provolone:~/tmp$ cat plink.qassoc
 CHR         SNP         BP    NMISS       BETA         SE         R2        T            P
   1   rs3094315     742429     3643    -0.2461     0.2703  0.0002275  -0.9102       0.3628
   1  rs12562034     758311     3644    -0.1806     0.3315  8.149e-05  -0.5448       0.5859
   1   rs3934834     995669     3641    0.04591     0.2822  7.271e-06   0.1627       0.8708
   1   rs9442372    1008567     3645     0.1032     0.2063  6.868e-05   0.5002       0.6169
   1   rs3737728    1011278     3644     0.1496     0.2268  0.0001195   0.6598       0.5094
   1   rs6687776    1020428     3645    -0.5378     0.2818   0.000999   -1.909      0.05639
   1   rs9651273    1021403     3643     0.2002     0.2264  0.0002149   0.8847       0.3764
   1   rs4970405    1038818     3645    -0.4994     0.3404  0.0005903   -1.467       0.1425
   1  rs12726255    1039813     3645    -0.4515     0.2956  0.0006398   -1.527       0.1268
turnersd@provolone:~/tmp$ cleanplink.pl plink.qassoc comma
turnersd@provolone:~/tmp$ ls
plink.qassoc  plink.qassoc.csv
turnersd@provolone:~/tmp$ cat plink.qassoc.csv
CHR,SNP,BP,NMISS,BETA,SE,R2,T,P
1,rs3094315,742429,3643,-0.2461,0.2703,0.0002275,-0.9102,0.3628
1,rs12562034,758311,3644,-0.1806,0.3315,8.149e-05,-0.5448,0.5859
1,rs3934834,995669,3641,0.04591,0.2822,7.271e-06,0.1627,0.8708
1,rs9442372,1008567,3645,0.1032,0.2063,6.868e-05,0.5002,0.6169
1,rs3737728,1011278,3644,0.1496,0.2268,0.0001195,0.6598,0.5094
1,rs6687776,1020428,3645,-0.5378,0.2818,0.000999,-1.909,0.05639
1,rs9651273,1021403,3643,0.2002,0.2264,0.0002149,0.8847,0.3764
1,rs4970405,1038818,3645,-0.4994,0.3404,0.0005903,-1.467,0.1425
1,rs12726255,1039813,3645,-0.4515,0.2956,0.0006398,-1.527,0.1268
turnersd@provolone:~/tmp$ cleanplink.pl plink.qassoc tab
turnersd@provolone:~/tmp$ ls
plink.qassoc  plink.qassoc.csv  plink.qassoc.tab
turnersd@provolone:~/tmp$ cat plink.qassoc.tab
CHR     SNP     BP      NMISS   BETA    SE      R2      T       P
1       rs3094315       742429  3643    -0.2461 0.2703  0.0002275       -0.9102 0.3628
1       rs12562034      758311  3644    -0.1806 0.3315  8.149e-05       -0.5448 0.5859
1       rs3934834       995669  3641    0.04591 0.2822  7.271e-06       0.1627  0.8708
1       rs9442372       1008567 3645    0.1032  0.2063  6.868e-05       0.5002  0.6169
1       rs3737728       1011278 3644    0.1496  0.2268  0.0001195       0.6598  0.5094
1       rs6687776       1020428 3645    -0.5378 0.2818  0.000999        -1.909  0.05639
1       rs9651273       1021403 3643    0.2002  0.2264  0.0002149       0.8847  0.3764
1       rs4970405       1038818 3645    -0.4994 0.3404  0.0005903       -1.467  0.1425
1       rs12726255      1039813 3645    -0.4515 0.2956  0.0006398       -1.527  0.1268

3 comments:

  1. perl -lpe 's/"/""/g; s/^|$/"/g; s/\t/","/g < input > output

    http://stackoverflow.com/questions/2535255/fastest-way-convert-tab-delimited-file-to-csv-in-linux

    That should work on Windows too. If you quote everything you wouldn't need to trim spaces. And for Linux/Mac

    tr \\t , < input > output

    although you will need some modifications to trim spaces as this is not quoted

    ReplyDelete
  2. Extremely useful, many many thanks!!!

    ReplyDelete
  3. Nice! PLINK fixed-width format is awful. Thanks!!!

    ReplyDelete

Note: Only a member of this blog may post a comment.

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.