Bertinoro Course on "Machine Learning for biological Sequence Analysis"
Course on "Machine Learning for biological Sequence Analysis" by Gunnar Rätsch on the 19st of March 2008 in Bertinoro, Italy.
The slides can be downloaded here.
Machine learning is the study of algorithms which generalize knowledge gained from empirical data. In this tutorial I will focus on supervised learning for biological sequence analysis, where a typical task is to predict properties of a sequence. Examples include protein homology detection, gene finding, prediction of protein functions, etc. I will start with a broad introduction into Machine learning including classification, regression, semi- and unsupervised learning, generalization performance and model selection. In the second part I will focus on Support Vector Machines (SVMs) -- the most popular example of binary classification algorithms. They utilize so-called kernels that formalize the similarity between examples and allow the design of efficient and mathematically elegant algorithms. In the third part I will introduce a few powerful kernel functions for sequence analysis in detail with practical examples. Finally, I will discuss several applications of these techniques in computational biology.
Machine learning is the study of algorithms which generalize knowledge gained from empirical data. We will focus on the supervised learning paradigm, where the algorithm is provided with training examples as well as an expert opinion of the correct answer. The algorithm’s task is to find the best decision function for future examples.
- Un- and Semi-supervised learning
- Generalization and Model selection
Support Vector Machines (SVMs) maximize the margin between positive and negative training examples. It is the most popular example of binary classification algorithms (algorithms which predict “yes/no” answers) which build upon the solid foundation of statistical learning and optimization theory. They utilize so-called kernels that formalize similarity functions and allow the design of efficient and mathematically elegant algorithms. Moreover, many statistical algorithms can be reformulated using kernels (usually referred to as the “kernel trick”) to allow nonlinear decision functions as well as structured data types.
- Maximal margin algorithm
- Convex optimization problems
- Positive semidefinite kernels
- Beyond 2-class classification
In this section, we explain how kernels can be defined on sequences such as DNA or amino acid sequences. These kernels are the modeling tool that allows us to apply the algorithms presented in the previous chapter on complex data structures arising in computational biology. We illustrate how a practitioner can construct kernels for a particular application by combining known kernels.
- Spectrum kernel and weighted degree kernel
- Guidelines for kernel design
We discuss how several important questions in bioinformatics have been tackled using SVMs and string kernels.
- Transcript Start Recognition
- Tiling Array Normalization
Additionally, I will mention a few software packages that implement the algorithms mentioned in the course.