For a few years now, my EvoSTAR colleague, Bill Langdon, has been exploring the degree to which Mycoplasma bacteria have contaminated experimental systems and even "infected" online databases with the contents of their genomes. He and his colleagues have previously shown that Mycoplasma genome sequences have previously been mislabeled as human sequences in several online resources (GenBank, dbEST, and RefSeq).
Early microarray designs were based largely on ESTs from these resources, and as a result, the Affymetrix HG-U133 plus 2.0 array contains probes for Mycoplasma sequences. Details for these probes can be found here. Exploiting these probes, Bill and colleagues have also examined the Gene Expression Omnibus for evidence of Mycoplasma contamination, and found around 30 studies (roughly 1% of GEO) that show high expression for this probe, the vast majority of which were from cell cultures.
By their proclivity to infect human experimental cell lines, Bill has playfully described Mycoplasma as having evolved the ability to transmit their genes into online databases.
Continuing this pursuit, Bill recently published an article in BMC BioData Mining illustrating Mycoplasma contamination of the 1000 Genomes Project. It is unclear what the implications of this contamination are for the integrity of 1000 Genomes Data, as the majority of identified Mycoplasma reads to not map to the human reference genome. This work should however serve as a bellwether to those performing experiments, or using experimental data from treated cell lines. In these situations, any contamination might severely taint experimental results.
Showing posts with label News. Show all posts
Showing posts with label News. Show all posts
Tuesday, May 6, 2014
Tuesday, April 27, 2010
Discovering New Disease Genes Using Orthologous Phenotypes in Model Organisms
Check out this paper in PNAS and the corresponding synopsis in the New York Times. The authors take a unique approach to finding genes likely to be associated with human traits using orthologous phenotypes in model organisms, or phenologs. The idea is simple. The authors have a database of ~2000 disease associated genes in humans. To this database they added another ~200,000 gene-trait associations in model organisms including mice, yeast, worm, and plants. Then they look for overlapping sets of orthologous genes from these organisms to identify phenotypes in the model organisms. The related genes causing orthologous phenotypes, or phenologs, are predictive of genes causing disease in humans. For example, the authors found genes responsible for angiogenesis using yeast, breast cancer associated genes in C. elegans, and even genes responsible for deafness using plants.
I remember seeing a talk about this at this year's Pacific Symposium in Biocomputing. You can learn more about the methodology at phenologs.org, and download all the original data used in the paper and build your own phenolog database, which could be very useful for disease gene prediction or prioritization of GWAS hits for followup.
PNAS: Systematic discovery of nonobvious human disease models through orthologous phenotypes
New York Times: The Search for Genes Leads to Unexpected Places
phenologs.org: Systematic discovery of non-obvious disease models and candidate genes
I remember seeing a talk about this at this year's Pacific Symposium in Biocomputing. You can learn more about the methodology at phenologs.org, and download all the original data used in the paper and build your own phenolog database, which could be very useful for disease gene prediction or prioritization of GWAS hits for followup.
PNAS: Systematic discovery of nonobvious human disease models through orthologous phenotypes
New York Times: The Search for Genes Leads to Unexpected Places
phenologs.org: Systematic discovery of non-obvious disease models and candidate genes
Tags:
News,
Recommended Reading,
SQL
Tuesday, March 30, 2010
Federal Courts Invalidate Myriad's Breast Cancer Gene Patents
A District Court handed down a summary judgment invalidating most of Myriad's claims to both the BRCA1 DNA sequence and the method of testing for early-onset familial breast and ovarian cancer. See Genetic Future and Genomics Law Report for analysis.
Tags:
News,
Noteworthy blogs,
Policy
Thursday, March 18, 2010
Francis Collins: Computational biologists are "breakthrough artists"
Just caught this on the OpenHelix Blog. In an interview with Charlie Rose, NIH director Francis Collins said Computational biologists will be the "breakthrough" artists of the future.
CHARLIE ROSE: You have said if you were starting over you would be a computational biologists.
FRANCIS COLLINS: I did say that. I still say that. Computational biologists are having a really good time and it’s going to get better.
CHARLIE ROSE: Their day is coming?
FRANCIS COLLINS: Their day is here, but it’s going to be even more here in a few years. So what do they do? They are people who are jointly trained in studying biology in all of its complexes, but they’re also very capable at computation analysis of huge data sets, because — in part because of NIH and the ethic that was adopted by the genome project, huge amounts of data are being made publicly accessible everyday about all kinds of disease questions.
CHARLIE ROSE: So they’re going to be the break through artists of the future?
FRANCIS COLLINS: They’re going to be the breakthrough artists...
Tags:
News
Monday, November 23, 2009
NYT: SAS threatened by R
The New York Times had an interesting piece yesterday about how SAS is facing several business threats from companies like the recently IBM-acquired SPSS, and from burgeoning interest in open-source software like R. The NYT ran an entire article about R earlier this year, and this article discusses how SAS has been revamping their technology to work seamlessly with R code in response to R's growing popularity in academia and other research labs.
NYT: At a Software Powerhouse, the Good Life Is Under Siege
NYT Slideshow: At SAS, Taking Care of Employees Is Good Business
NYT: At a Software Powerhouse, the Good Life Is Under Siege
NYT Slideshow: At SAS, Taking Care of Employees Is Good Business
Tags:
News,
R,
Recommended Reading
Tuesday, August 25, 2009
Great Quote from Our Own Doug Mortlock
http://www.harvardscience.harvard.edu/blog/how-dachshund-lost-its-legs
'It’s stunning to see a genetic modification like this,' developmental geneticist Douglas Mortlock of Vanderbilt University in Nashville, Tenn., says of the new study, published online July 16 in Science. 'This is the gene that makes wiener dogs short-legged.'
Tags:
News
Thursday, August 13, 2009
Beautiful Info-graphic
While not directly related to genetics, this is an excellent example of well-designed data representation. The New York Times reports the results of a survey of average time spent on various activities through the day by different groups of people.

The graphic is essentially a stacked density plot with time (24 hours) on the X-axis. Clicking on a different group of individuals provides a very smooth transition to the new density distribution, allowing an animated visual comparison. In some ways, this animated version provides an easier comparison than showing multiple versions of the same figure. Furthermore, there is just something compelling about this figure that begs you to examine it more closely...
http://www.nytimes.com/interactive/2009/07/31/business/20080801-metrics-graphic.html?scp=3&sq=infographic&st=cse

The graphic is essentially a stacked density plot with time (24 hours) on the X-axis. Clicking on a different group of individuals provides a very smooth transition to the new density distribution, allowing an animated visual comparison. In some ways, this animated version provides an easier comparison than showing multiple versions of the same figure. Furthermore, there is just something compelling about this figure that begs you to examine it more closely...
http://www.nytimes.com/interactive/2009/07/31/business/20080801-metrics-graphic.html?scp=3&sq=infographic&st=cse
Tags:
News,
Visualization
Next-Gen Sequencing
Logan recently emailed me an article in the New York Times about single-molecule DNA sequencing and I realized I knew next to nothing about the new and emerging technology that will change the way we do association studies (that is, if we're still even trying to find genetic associations in the first place). The Wellcome Trust posted a news feature a few weeks back giving brief explanations and short videos on DNA sequencing, starting with the old Sanger method, then the second generation 454 and Illumina (Solexa) technologies. They also give a quick overview and and link to some of the 3rd generation technologies in the pipeline, including Pac Bio, Oxford Nanopore, and Complete Genomics.
Wellcome Trust feature on next-gen sequencing
Wellcome Trust feature on next-gen sequencing
Tags:
News,
Sequencing
Thursday, August 6, 2009
For Today’s Graduate, Just One Word: Statistics
That's the title of a good article published yesterday in the New York Times about the emergence of statistics being in huge demand in the career market, becoming "the sexy job in the next 10 years" as Google's chief economist puts it. Now I just need to find one of these don't drink and derive t-shirts...
For Today’s Graduate, Just One Word: Statistics
For Today’s Graduate, Just One Word: Statistics
Tags:
News
Monday, July 13, 2009
23andMe Research Revolution
Still, it's worth keeping an eye on the new "democratizing research" approach that 23andMe is taking. Current 23andMe customers are invited to participate in any of these research projects, and their enrollment counts as a "vote" towards starting a research project on that particular phenotype. They're also offering a basic version of their personal genome services for $99 for new customers interested in participating. Daniel MacArthur at Genetic Future wrote an interesting piece on 23andMe's latest venture - an article definitely worth reading, and the comments also. In it he suggests that
...academics need to take heed of the model the company is pursuing. It's likely that over the next few years the current model for returning research data to participants - i.e. don't - will become increasingly unpopular with potential research subjects, and indeed I'd argue that this model has always bordered on the unethical. Finding realistic ways of presenting large-scale genetic data to research participants is something that academic researchers will need to sort out soon, one way or another - and those that do it well, I suspect, will find it much easier to recruit and maintain their research cohorts.
Keep an eye out here for more news on the personal genomics scene.
Tags:
News
Thursday, July 9, 2009
Obama names Francis Collins NIH Director
In his statement, Obama noted: "The National Institutes of Health stands as a model when it comes to science and research. My administration is committed to promoting scientific integrity and pioneering scientific research and I am confident that Dr. Francis Collins will lead the NIH to achieve these goals. Dr. Collins is one of the top scientists in the world, and his groundbreaking work has changed the very ways we consider our health and examine disease. I look forward to working with him in the months and years ahead."
Tags:
News
Tuesday, June 16, 2009
NYT: In Simulation Work, the Demand Is Real
The New York Times published this interesting article on how the ability to design and perform computer simulations is a highly marketable skill for careers across many disciplines.
In methodology development we use simulation nearly every day. We've developed our own specialized genetic data simulation software, genomeSIMLA, that's freely available here by request for PC, Mac, and Linux.
But if you have R on your computer (get it free here), here's how to do a really simple Monte Carlo simulation to determine the power of a one-sample t-test.
First, fire up R and type this command:
rnorm(100)
That command generates 100 random numbers drawn from a standard normal distribution, mean=0, sd=1. Now type this:
rnorm(100,mean=2,sd=7)
That also draws 100 random numbers from a normal distribution, but this time the mean is 2 and the standard deviation is 7. You can also get the same results by just typing this:
rnorm(100,2,7)
Now, let's do a one sample t-test:
t.test(rnorm(100,2,7))
That command performs a one-sample t-test on the 100 samples drawn from a normal distribution with mean=2 and sd=7. Remember, the null hypothesis of a one-sample t-test is usually "the mean is not significantly different from zero". So if the p-value is less than .05, we would typically reject this null, and say that the mean is significantly different from zero.
Now, we knew that the mean was different from zero, because we said draw from a distribution with mean=2. But if this was the case and we only drew 100 samples, how likely is it that we would detect a difference? That's the power of the test - given that the null is false, how likely is it that we reject the null hypothesis?
One way we can answer this question is with a simulation.
First, let's type the same command, but just get ONLY the p-value from the t-test:
t.test(rnorm(100,2,7))$p.value
Was is less than .05? Try typing it again (you can hit the up arrow key to bring up the last command in R, just like on the Linux command line). It will be different because we have a different set of 100 observations. Type it in over and over again. Sometimes it will be less than .05, other times it wont be. Let's do this 1000 times, and see how often it is less than .o5. Let's use the replicate command:
replicate(1000,t.test(rnorm(100,2,7))$p.value)
That simulates doing the t-test 1000 times, and gives you the p-value from each one.
Now, let's do a logical test to see which of those are less than .05:
replicate(1000,t.test(rnorm(100,2,7))$p.value)<0.05
If you typed that in you'll see lots of TRUE's and FALSE's. TRUE means that the t-test on that particular sample was less than .05. Now, internally, R represents TRUE as 1, and FALSE as 0. So if we take the average of all 1000 of these, that will tell us the proportion of times out of 1000 trials that the p-value of the one-sample t-test was less than .o5:
mean(replicate(1000,t.test(rnorm(100,2,7))$p.value)<0.05)
When I did this the power was right around 80%. If you do this again it will be slightly different because remember we are sampling randomly so the results will vary slightly!
Congratulations, you just did your first simulation / power study! Of course because we know what the null distribution of t-statistics looks like under the null, we can mathematically determine the power of a t-test without doing simulation studies:
power.t.test(n=100,delta=2,sd=7,sig.level=.05,type="one.sample")
But if we had developed our own method or algorithm we probably wouldn't have a mathematical formula to calculate power, which is why we rely on simulation studies. Be sure to check out my other posts on power calculation software, choosing the correct analyses, and code to run analyses in R and other software.
In methodology development we use simulation nearly every day. We've developed our own specialized genetic data simulation software, genomeSIMLA, that's freely available here by request for PC, Mac, and Linux.
But if you have R on your computer (get it free here), here's how to do a really simple Monte Carlo simulation to determine the power of a one-sample t-test.
First, fire up R and type this command:
rnorm(100)
That command generates 100 random numbers drawn from a standard normal distribution, mean=0, sd=1. Now type this:
rnorm(100,mean=2,sd=7)
That also draws 100 random numbers from a normal distribution, but this time the mean is 2 and the standard deviation is 7. You can also get the same results by just typing this:
rnorm(100,2,7)
Now, let's do a one sample t-test:
t.test(rnorm(100,2,7))
That command performs a one-sample t-test on the 100 samples drawn from a normal distribution with mean=2 and sd=7. Remember, the null hypothesis of a one-sample t-test is usually "the mean is not significantly different from zero". So if the p-value is less than .05, we would typically reject this null, and say that the mean is significantly different from zero.
Now, we knew that the mean was different from zero, because we said draw from a distribution with mean=2. But if this was the case and we only drew 100 samples, how likely is it that we would detect a difference? That's the power of the test - given that the null is false, how likely is it that we reject the null hypothesis?
One way we can answer this question is with a simulation.
First, let's type the same command, but just get ONLY the p-value from the t-test:
t.test(rnorm(100,2,7))$p.value
Was is less than .05? Try typing it again (you can hit the up arrow key to bring up the last command in R, just like on the Linux command line). It will be different because we have a different set of 100 observations. Type it in over and over again. Sometimes it will be less than .05, other times it wont be. Let's do this 1000 times, and see how often it is less than .o5. Let's use the replicate command:
replicate(1000,t.test(rnorm(100,2,7))$p.value)
That simulates doing the t-test 1000 times, and gives you the p-value from each one.
Now, let's do a logical test to see which of those are less than .05:
replicate(1000,t.test(rnorm(100,2,7))$p.value)<0.05
If you typed that in you'll see lots of TRUE's and FALSE's. TRUE means that the t-test on that particular sample was less than .05. Now, internally, R represents TRUE as 1, and FALSE as 0. So if we take the average of all 1000 of these, that will tell us the proportion of times out of 1000 trials that the p-value of the one-sample t-test was less than .o5:
mean(replicate(1000,t.test(rnorm(100,2,7))$p.value)<0.05)
When I did this the power was right around 80%. If you do this again it will be slightly different because remember we are sampling randomly so the results will vary slightly!
Congratulations, you just did your first simulation / power study! Of course because we know what the null distribution of t-statistics looks like under the null, we can mathematically determine the power of a t-test without doing simulation studies:
power.t.test(n=100,delta=2,sd=7,sig.level=.05,type="one.sample")
But if we had developed our own method or algorithm we probably wouldn't have a mathematical formula to calculate power, which is why we rely on simulation studies. Be sure to check out my other posts on power calculation software, choosing the correct analyses, and code to run analyses in R and other software.
Tags:
News,
R,
Recommended Reading,
Statistics,
Tutorials
Thursday, June 11, 2009
Get your genome sequenced by Illumina for $48k
This week Illumina launched their own personal genome sequencing service. For $48,000 they'll send you the sequence of your entire genome on a Mac computer that you can keep. According to their website, all the sequencing is done in a CLIA-certified clinical lab. One thing different about this than other consumer genetics services is that they require you to consult your doctor before signing up, and have them request sequencing for you, like writing a prescription. Then they send the sequence back to your doctor to discuss your results.
Now, even as a geneticist I'm not sure what to tell a layperson to do with all 3 billion of their bases sequenced, so what is a general practitioner to do when their patient seeks medical advice based on a service like this? Share your thoughts in the comments.
http://www.everygenome.com/
Now, even as a geneticist I'm not sure what to tell a layperson to do with all 3 billion of their bases sequenced, so what is a general practitioner to do when their patient seeks medical advice based on a service like this? Share your thoughts in the comments.
http://www.everygenome.com/
Tags:
News,
Sequencing
Thursday, May 28, 2009
Statistics and sex appeal
Google's chief economist was recently quoted as saying "The sexy job in the next ten years will be statisticians… The ability to take data-to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it-that’s going to be a hugely important skill." I'll leave you for the weekend with this ego-boosting article relating how our skill set as statisticians is a hot commodity in the real world.
Dataspora Blog: The three sexy skills of data geeks
Dataspora Blog: The three sexy skills of data geeks
Tags:
News,
Noteworthy blogs,
Statistics
Wednesday, April 22, 2009
Sequencing is Not (Yet) the Silver Bullet
Amidst the fallout of an academic discussion over the worth of GWA studies followed by several gloomy and scathing articles in the popular press, came this paper in Nature Genetics. In summary, the investigators sequenced all the coding DNA on the X-chromosome in families affected with an evidently X-linked mental retardation phenotype. They found nearly 1000 changes that would either alter the amino acid sequence, introduce a frameshift or stop codon, or interfere with splicing, all illustrating the huge heterogeneity problem and the analytical conundrum to face when there are too many potential susceptibility/causal changes. What's more, nearly all of the protein-truncating variants they found were unique to individual families, and many of these were found in both affected and unaffected males! I believe this highlights the fact that an argument pushing for whole-genome sequencing should not omit a discussion of the analytical and interpretation issues we will have to deal with when the time comes.
Nature Genetics: A systematic, large-scale resequencing screen of X-chromosome coding exons in mental retardation
Abstract: Large-scale systematic resequencing has been proposed as the key future strategy for the discovery of rare, disease-causing sequence variants across the spectrum of human complex disease. We have sequenced the coding exons of the X chromosome in 208 families with X-linked mental retardation (XLMR), the largest direct screen for constitutional disease-causing mutations thus far reported. The screen has discovered nine genes implicated in XLMR, including SYP, ZNF711 and CASK reported here, confirming the power of this strategy. The study has, however, also highlighted issues confronting whole-genome sequencing screens, including the observation that loss of function of 1% or more of X-chromosome genes is compatible with apparently normal existence.
Nature Genetics: A systematic, large-scale resequencing screen of X-chromosome coding exons in mental retardation
Abstract: Large-scale systematic resequencing has been proposed as the key future strategy for the discovery of rare, disease-causing sequence variants across the spectrum of human complex disease. We have sequenced the coding exons of the X chromosome in 208 families with X-linked mental retardation (XLMR), the largest direct screen for constitutional disease-causing mutations thus far reported. The screen has discovered nine genes implicated in XLMR, including SYP, ZNF711 and CASK reported here, confirming the power of this strategy. The study has, however, also highlighted issues confronting whole-genome sequencing screens, including the observation that loss of function of 1% or more of X-chromosome genes is compatible with apparently normal existence.
Tags:
News,
Recommended Reading,
Sequencing
Friday, April 17, 2009
Five articles on the success/failure of GWAS
Here are four interesting and provocative articles in New England Journal of Medicine:
David Goldstein: Common Genetic Variation and Human Traits
Joel Hirschhorn: Genomewide Association Studies — Illuminating Biologic Pathways
Peter Kraft and David Hunter: Genetic Risk Prediction — Are We There Yet?
John Hardy & Andrew Singleton: Genomewide Association Studies and Human Disease
And after all their hype over the last year or so, a rather pessimistic article in the NYT.
Share your thoughts in the comments!
David Goldstein: Common Genetic Variation and Human Traits
Joel Hirschhorn: Genomewide Association Studies — Illuminating Biologic Pathways
Peter Kraft and David Hunter: Genetic Risk Prediction — Are We There Yet?
John Hardy & Andrew Singleton: Genomewide Association Studies and Human Disease
And after all their hype over the last year or so, a rather pessimistic article in the NYT.
Share your thoughts in the comments!
Tags:
GWAS,
News,
Recommended Reading
Subscribe to:
Posts (Atom)
