# MEDFL5275 - Prediction (in Molecular Biology)

## Course description

## Course content

The course focuses on prediction of future and/or unmeasured outcomes based on a variety of high-dimensional molecular data. What do we want to

predict? This is typically the success or not of a therapy given to a patient (binary or categorical outcome, also called classification); it can be the bone mineral density or the expression in so called eQTL studies (continuous outcomes); it can be survival after cancer surgery or time to recurrence of a disease (time to event outcomes). In this course we are not studying methods to subdivide the patients in a study in subgroups, as is the aim in sub-typing a disease (unsupervised clustering).

Prediction is based on various models which exploit molecular data as input data (genomics, metabolomics, proteomic, epigenetic data, for example) in addition to other individual variables (demographic, clinical, exposure data). What characterizes these data is their huge dimension (say all genes or all SNPs, so a large number p of variables), compared to a smaller number (n) of individuals in a study. Variables can be discrete, categorical, continuous and also related to more complex structures like ontologies and pathways (networks).

There are many methods which can be used to predict outcomes from data, in a p>n setting. In this course we will focus on A. Penalized methods, like lasso, ridge and elastic net, including parameter tuning using cross- validation, B. Bayesian methods, based on prior knowledge and exploiting Markov Chain Monte Carlo algorithms, C. Machine learning approaches, including tree-based methods, support vector machines, kernel methods and neural networks/ deep learning. We will study ways of combining different predictions with

D. Boosting, bagging and other ensemble methods. Finally we will discuss how to compare and evaluate various prediction methods to determine which one performs best: E. Performance measures of prediction methods and their estimation using resampling methods (bootstrapping, cross-validation). Additional themes which will be treated in the course, and will appear across the five topics above, include (i) selection of variables (ii) interaction (iii) integration of various data sets at different scale, (iv) resampling methods.

The plan of the five days is as follows:

Day 1: Introduction to prediction (versus clustering, estimation and testing). Introduction to the p>n setting. A series of examples of papers based on prediction. Introduction to software: R, Bioconductor, Stan, JAGS and other libraries in R.

Day 2: Penalized approaches (A). Software: glmnet package.

Day 3: Bayesian predictions (B). Software: Stan.

Day 4: Machine learning methods (C). Software: various R/Bioconductor

Day 5: Ensemble methods and performance evaluation (scoring).

## Learning outcome

After the course the students should

• know what prediction is in contrast to estimation, testing, and clustering,

• know which steps are involved in a prediction task and which pitfalls need to be avoided,

• be able to identify appropriate methods for a given problem, and to perform prediction tasks using R and Bioconductor packages,

• be able to assess methods they read about and to put them in the wider context,

• be able to assess the performance of prediction results, as they are typically reported in publications.

## Admission

The course is restricted to students at the Medical Student Research Programme at the Faculty of Medicine and the Faculty of Dentistry, UiO.

Students apply in StudentWeb.

The courses MEDFL5275 and IMB9275 have common admission.

## Prerequisites

### Formal prerequisite knowledge

Passed exam in an introductory course in statistics (e.g. MF9130) and in an advanced course in statistics, which includes multiple regression.

### Recommended previous knowledge

Basic knowledge in linear algebra and statistics is expected. The practicals will be run using the statistical computing environment R and Bioconductor. We expect students to be familiar with performing data analysis in R/ Bioconductor.

## Overlapping courses

5 credits overlap with IMB9275 - Prediction (in Molecular Biology)

## Teaching

The course will be given as an intensive one-week long course (5 days, Monday-Friday) with lectures (three hours including discussions in the mornings) and practical hands-on sessions (four hours in the afternoons). During the practical sessions the students will use R/Bioconductor to analyze given datasets using different prediction approaches. On the last day the students will give brief presentations of their prediction results to the class. This will be followed by a summary session. Students will have the opportunity to provide feedback at the end of the course. The students will receive a reading list before the course and are expected to prepare well for the course. The students will do a project after the course, possibly using their own data, and deliver a written report within a month (home exam). It will be the aim to divide students into small groups where group members have complementary backgrounds (e.g. one biostatistics, one bioinformatics, and one molecular biology student in each group). Students will need to bring their own laptop. For the course we encourage the use of Rstudio (http://www.rstudio.com) and of reproducible research tools knitR (http://yihui.name/knitr) and R Markdown.

You have to participate in at least 80 % of the teaching to be allowed to take the exam. Attendance will be registered.

## Examination

The exam will be a home exam in the form of project work. Students will have to deliver a written report about their project within a month after completing the course.

The exam paper will be provided in English. The exam should be answered in English.

### Grading scale

Grades are awarded on a pass/fail scale. Read more about the grading system.

### Explanations and appeals

### Resit an examination

### Withdrawal from an examination

It is possible to take the exam up to 3 times. If you withdraw from the exam after the deadline or during the exam, this will be counted as an examination attempt.

## Evaluation

The course is subject to continuous evaluation. At regular intervals we also ask students to participate in a more comprehensive evaluation.

This course is organized as part of the National Research School in Bioinformatics, Biostatistics and Systems Biology NORBIS