Margin-based Prediction of Polymorphic Regions (mPPR)
Supplementary Website for our paper on "Detecting Polymorphic Regions in Arabidopsis thaliana with Resequencing Microarrays" published in Genome Research
To cite polymorphic region data please refer to:
Georg Zeller, Richard M Clark, Korbinian Schneeberger, Anja Bohlen, Detlef Weigel and Gunnar Rätsch (2008) Detecting Polymorphic Regions in the Arabidopsis thaliana Genome with Resequencing Microarrays. Genome Res. 2008 18: 918-929.
In a previous project, we identified single nucleotide polymorphisms (SNPs) as the most common form of natural sequence variation in the model plant Arabidopsis thaliana using whole-genome resequencing with high-density oligonucleotide arrays . On these arrays hybridization signals from nearly one billion features were measured for each of 20 wild strains (accessions) of A. thaliana, including the reference accession Col-0 (with a known genome sequence of about 125 Mb).
Recall for SNPs is typically high in regions of low to moderate polymorphism density. However, for regions of clustered SNPs, which are often accompanied by indels, neighboring polymorphisms (at a distance <25 bp) disrupt the signal for SNP detection. Thus, regions with very few SNP calls can either indicate high similarity to the reference or densely clustered polymorphisms that went mostly undetected .
We developed a machine learning method (margin-based prediction of polymorphic regions-mPPR) to reliably recognize the pattern of suppressed intensity that results from clustered polymorphisms or deletions. Our label sequence learning algorithm is an extension of Hidden Markov Support Vector Machines (HM SVMs)  which are conceptually similar to Hidden Markov Models, but trained with discriminative learning techniques inspired by SVMs.
On the genomic scale we detected between 240,000 and 361,000 polymorphic regions per accession comprising between 5.3% and 8.5% of the genome. For these predictions we estimated a false discovery rate of <10% and a sensitivity of 55%.
|Accession||Number of PRs||% genome in PRs|
The software is available in two versions:
- A version that has been used to produce the results in the paper. It can be downloaded here: http://www.fml.tuebingen.mpg.de/raetsch/projects/mppr/mppr-0.1.tar.gz. This version is available for academic use only.
- An open source toolbox called HMSVM with an improved and easier-to-use implementation of the algorithm is available at: http://www.fml.tuebingen.mpg.de/raetsch/projects/mppr/hmsvm-0.1.tar.gz.
|||(1, 2, 3) Clark, Schweikert, Toomajian, Ossowski, Zeller, Shinn, Warthmann, Hu, Fu, Hinds, Chen, Frazer, Huson, Schoelkopf, Nordborg, Raetsch, Ecker, Weigel. Common sequence polymorphisms shaping genetic diversity in Arabidopsis thaliana. Science, 317(5836):338342, 2007.|
|||Tsochantaridis, Joachims, Hofmann, and Altun. Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res., 6:14531484, 2005.|