Hansong Wang, our biostats professor here at the Hawaii Cancer Center, generously gave me some R code that goes through a SNP annotation file (i.e. a mapfile) and selects SNPs that are at least a certain specified distance apart. You might want to do this if you're picking a subset of SNPs for PCA, for instance. Plink has an LD pruning feature, but if you can't load your data into PLINK, this poor-man's-pruning based on physical distance (not LD) is a quick solution.
Provide the function with a data frame containing containing column names "chrom" and "position," where the SNPs are ordered by chromosome and position. By default the function selects SNPs that are at least 100kb apart, but you can change this with the optional second argument. The function returns the row indices corresponding to the SNPs you want to keep. Then simply subset your dataset selecting only those row indices and all columns.
Tuesday, March 29, 2011
Monday, March 21, 2011
RStudio Keyboard Shortcut Reference PDF
I recently started using RStudio, the amazing new IDE for R. You can view all of RStudio's keyboard shortcuts by going to the help menu, but I made this printable reference for myself and thought I'd share it. I only included the Windows shortcuts, and I cut out all the obvious ones (Ctrl-S for save, Ctrl-O for open, etc) so it would fit neatly on one page.
RStudio Keyboard Shortcuts (PDF)
RStudio Keyboard Shortcuts (PDF)
Friday, March 18, 2011
New GenABEL Website, and more *ABEL software
The *ABEL suite of R packages and software for genetic analysis has grown substantially since the appearance of GenABEL and the previously mentioned ProbABEL R packages. There are now a handful of useful R packages and other software utilities facilitating genome-wide association studies, analysis of imputed data, meta-analysis, efficient data storage, prediction, parallelization, and mixed model methods. The new GenABEL website also has prominent links to extensive documentation in manuals and tutorials. Finally, there's a new forum you can visit if you can't find an answer in the manuals or tutorials.
The GenABEL Project for Statistical Genetics Analysis
The GenABEL Project for Statistical Genetics Analysis
Tuesday, March 15, 2011
Its like Photoshop for Graphs
Thanks to links from Sarah Pendergrass, I stumbled upon this awesome program for graph visualization and analysis called Gephi. It seems rather feature rich, with built in connectors to database systems, extensive graph coloring, layout, and rendering features, and several analysis tools. Gephi is an open-source project.
(http://gephi.org/)
To see an interactive example, check out the Diseaseome (http://diseasome.eu/map.html)
Wednesday, March 9, 2011
Forest plots using R and ggplot2
Abhijit over at Stat Bandit posted some nice code for making forest plots using ggplot2 in R. You see these lots of times in meta-analyses, or as seen in the BioVU demonstration paper. The idea is simple - on the x-axis you have the odds ratio (or whatever stat you want to show), and each line is a different study, gene, SNP, phenotype, etc. You draw a dot for the odds ratio point estimate, and lines extending from that point showing the confidence limits for that odds ratio.
Here's the R code, slightly modified from Abhijit's version:
And here's what you get:
Thanks for sharing the code, Abhiji!
Here's the R code, slightly modified from Abhijit's version:
And here's what you get:
Thanks for sharing the code, Abhiji!
Tags:
ggplot2,
R,
Visualization
Tuesday, March 8, 2011
Splitting a Dataset Revisited: Keeping Covariates Balanced Between Splits
In my previous post I showed you how to randomly split up a dataset into training and testing datasets. (Thanks to all those who emailed me or left comments letting me know that this could be done using other means. As things go with R, it's sometimes easier to write a new function yourself than it is to hunt down the function or package that already exists.)
What if you wanted to split a dataset into training/testing sets but ensure that there are no significant differences between a variable of interest across the two splits?
For example, if we use the splitdf() function from last time to split up the iris dataset, setting the random seed to 44, it turns out the outcome variable of interest, Sepal.Length, differs significantly between the two splits.
What if we wanted to ensure that the means of Sepal.Length, as well as the other continuous variables in the dataset, do not differ between the two splits?
Again, this is probably something that's already available in an existing package, but I quickly wrote another function to do this. It's called splitdf.randomize(), which depends on splitdf() from before. Here, you give splitdf.randomize() your data frame you want to split, and a character vector containing all the columns you want to keep balanced between the two splits. The function is a wrapper for splitdf(). It randomly makes a split and does a t-test on each column you specify. If the p-value on that t-test is less than 0.5 (yes, 0.5, not 0.05), then the loop will restart and try splitting the dataset again. (Currently this only works with continuous variables, but if you wanted to extend this to categorical variables, it wouldn't be hard to throw in a fisher's exact test in the while loop)
For each iteration, the function prints out the p-value for the t-test on each of the variable names you supply. As you can see in this example, it took four iterations to ensure that all of the continuous variables were evenly distributed among the training and testing sets. Here it is in action:
What if you wanted to split a dataset into training/testing sets but ensure that there are no significant differences between a variable of interest across the two splits?
For example, if we use the splitdf() function from last time to split up the iris dataset, setting the random seed to 44, it turns out the outcome variable of interest, Sepal.Length, differs significantly between the two splits.
splitdf <- function(dataframe, seed=NULL) { if (!is.null(seed)) set.seed(seed) index <- 1:nrow(dataframe) trainindex <- sample(index, trunc(length(index)/2)) trainset <- dataframe[trainindex, ] testset <- dataframe[-trainindex, ] list(trainset=trainset,testset=testset) } data(iris) s44 <- splitdf(iris, seed=44) train <- s1$trainset test <- s1$testset t.test(train$Sepal.Length, test$Sepal.Length)
What if we wanted to ensure that the means of Sepal.Length, as well as the other continuous variables in the dataset, do not differ between the two splits?
Again, this is probably something that's already available in an existing package, but I quickly wrote another function to do this. It's called splitdf.randomize(), which depends on splitdf() from before. Here, you give splitdf.randomize() your data frame you want to split, and a character vector containing all the columns you want to keep balanced between the two splits. The function is a wrapper for splitdf(). It randomly makes a split and does a t-test on each column you specify. If the p-value on that t-test is less than 0.5 (yes, 0.5, not 0.05), then the loop will restart and try splitting the dataset again. (Currently this only works with continuous variables, but if you wanted to extend this to categorical variables, it wouldn't be hard to throw in a fisher's exact test in the while loop)
For each iteration, the function prints out the p-value for the t-test on each of the variable names you supply. As you can see in this example, it took four iterations to ensure that all of the continuous variables were evenly distributed among the training and testing sets. Here it is in action:
Tags:
Machine Learning,
R,
Statistics
Thursday, March 3, 2011
Genetic Alliance Celebrates its 25th Year
Genetic Alliance is a nonprofit health advocacy organization that improves health through the authentic engagement of communities and individuals. This year, they are celebrating their 25th anniversary, and they're hosting a variety of events throughout the year, including monthly salons around the country and the 25th Anniversary Annual Conference in June. If you cannot attend an event in person, you can still learn all about genetics, health, and advocacy with their webinar series. I've watched a few of these webinars in the past, including one on the Myriad gene patent case featuring John Conley from Genomics Law Report. In addition, they are honoring innovators in the genetics community and post new videos weekly.
Read more to find out how to get involved: http://www.geneticalliance.org/25anniversary.
Tags:
Announcements,
Policy
Subscribe to:
Posts (Atom)