Transcript Normalization and Segmentation of Tiling Array Data
Supplementary Website for our paper on "Transcript Normalization and Segmentation of Tiling Array Data" presented at the Pacific Symposium on Biocomputing 2008
For the analysis of transcriptional tiling arrays we have developed two methods based on state-of-the-art machine learning algorithms. First, we present a novel transcript normalization technique to alleviate the effect of oligonucleotide probe sequences on hybridization intensity. It is specifically designed to decrease the variability observed for individual probes complementary to the same transcript. Applying this normalization technique to Arabidopsis tiling arrays, we are able to reduce sequence biases and also significantly improve separation in signal intensity between exonic and intronic/intergenic probes. Our second contribution is a method for transcript mapping. It extends an algorithm proposed for yeast tiling arrays to the more challenging task of spliced transcript identification. When evaluated on raw versus normalized intensities our method achieves highest prediction accuracy when segmentation is performed on transcript-normalized tiling array data.
To cite the transcript normalization or segmentation method please refer to
Zeller G., Henz S.R., Laubinger S., Weigel D., and Rätsch G. (2008) Transcript Normalization and Segmentation of Tiling Array Data, Pacific Symposium on Biocomputing 13:527-538.
For all types of DNA oligonucleotide microarrays strong effects of probe sequences on hybridization intensity have been observed (e.g. , ). To reduce within-gene variability of transcriptome whole-genome tiling array measurements we have developed a novel transcript normalization technique. Assuming that hybridization intensity should ideally be constant across all probes interrogating the same transcript, we related the deviations between observed intensities and ideal transcript intensities to the sequences of oligonucleotide probes (see Figure 1 for an illustration).
We modeled probe sequences as 0/1 vectors that indicate position-specific occurrences of mono-, di- and trimer substrings and perform linear regression (Ridge regression) to predict the deviation from median transcript intensities. After correcting with these estimates, the variation of individual probes from constant transcript intensities decreased to approximately one half and sequence effects such as GC bias were indeed reduced to an extent that is comparable to other methods  (Figure 2). In contrast to the aproach by Royce and colleagues  transcript normalization resulted in improved separation between intensities of exonic probes on the one hand and intronic and intergenic probes on the other hand (Figure 3) especially for genes with low to moderate expression levels (Figure 4).
The confirmation of annotated genes as well as the detection of novel transcripts requires a segmentation of transcriptome tiling array measurements into exonic and intronic or untranscribed regions. We approached this task with an HM-SVM based method called mSTAD (margin-based segmentation of tiling array data) that conceptually extends a recently published transcript mapping algorithm applied to yeast tiling array data . HM-SVMs  combine the benefits of Hidden Markov Models by introducing a state model that defines allowed segmentations (Figure 5) with those of large-margin classifiers such as Support Vector Machines.
MSTAD identified exonic probes more accurately than naive thresholding techniques or the widely used transfrag method  implemented in the Affymetrix tiling array software (Figure 6).
If you are interested in finding out how your favorite Arabidopsis gene can be detected in a tiling array atlas of transcription surveying several tissues and developmental stages , have a look at the At-TAX Genome Browser.
The transcript-normalization and the mSTAD segmentation method are free software. Source code and supplementary data can be downloaded from our FTP server.
mSTAD has been applied to tiling array data from several tissues and developmental stages of the model plant Arabidopsis thaliana. Expression data, predicted transcripts visualizations and additional reosurces are available at the At-TAX project page.
Using a similar HM-SVM-based approach, we identified polymorphic regions defining SNPs, SNP clusters, and small to large indels from resequencing microarrays. Details can be found on the mPPR project page.
|||Naef and Magnasco. Solving the riddle of the bright mismatches: the physics of hybridization. Phys. Rev. E 68, 011906, 2003.|
|||(1, 2, 3) Royce, Rozowsky and Gerstein. Assessing the need for sequence-based normalization in tiling microarray experiments. Bioinf., 23(8):988-997, 2007.|
|||Tsochantaridis, Joachims, Hofmann and Altun. Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res., 6:1453-1484, 2005.|
|||Huber, Toedling and Steinmetz. Transcript mapping with high-density oligonucleotide tiling arrays. Bioinf. 22(6):1963-1970, 2006.|
|||Kampa, Cheng, Kapranov, Yamanaka, Brubaker, et al.. Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. Genome Res. 14(3):331-342, 2004.|
|||Laubinger, Zeller, Henz Sachsenberg, Widmer, et al.. At-TAX: a whole genome tiling array resource for developmental expression analysis and transcript identification in Arabidopsis thaliana. Genome Biol. 9:R112, 2008.|