Tuesday, April 12, 2011

Using R + Bioconductor to Get Flanking Sequence Given Genomic Coordinates

I'm working on a project using next-gen sequencing to fine-map a genetic association in a gene region. Now that I've sequenced the region in a small sample, I'm picking SNPs to genotype in a larger sample. When designing the genotyping assay the lab will need flanking sequence.

This is easy to get for SNPs in dbSNP, but what about novel SNPs? Specifically, given a list of genomic positions where about half of them are potentially novel SNPs in the population I'm studying, how can I quickly and automatically fetch flanking sequence from the reference sequence? I asked how to do this on previously mentioned Biostar, and got several answers within minutes.

Fortunately this can easily be done using R and Bioconductor. For several reasons BioMart wouldn't do what I needed it to do, and I don't know the UCSC/Ensembl schemas well enough to use a query or the Perl API.

This R code below (modified from Adrian's answer on BioStar) does the trick:


  1. Looks like it will work. You need to make sure that the genomic coordinates between the snp mapping from NGS match those in the reference assembly and the mapping was actually valid. This can sometimes be a problem, but I am sure you verified that before even asking the question.

  2. Hi Will,

    I tried the BSgenome approach to import the horse genome but apparently it isn't there just yet in the BSgenome genomes database. Is there a way I can import ucsc horse genomic track given the coordinates.!?

    Your help with this is highly appreciated. Thank you!:-)


Note: Only a member of this blog may post a comment.

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.