Ing Ind - Inf (Mag.)(ord. 270) - MI (481) COMPUTER SCIENCE AND ENGINEERING - INGEGNERIA INFORMATICA

*

A

ZZZZ

052498 - APPLIED STATISTICS

Ing Ind - Inf (Mag.)(ord. 270) - MI (487) MATHEMATICAL ENGINEERING - INGEGNERIA MATEMATICA

*

A

ZZZZ

052498 - APPLIED STATISTICS

052742 - APPLIED STATISTICS

Obiettivi dell'insegnamento

The course educates students to become professional data scientits by reviewing several statistical models, methods and algortihms aimed at the statistical analysis of mutivariate data. The aim is twofold: first, to provide future data scientists with the mathematical knowledge and understanding necessary for making proper judgments when selecting appropriate methods and algorithms, producing sensible interpretations of the results generated by their application and grasping their strenghts and weaknesses and, second, to enhace the students ability in applying the knowledge and understanding of the statistical approach to data analysis by emphasizing real life and engineering applications where statistical methods and modern computer packages permit to elicit information from multivariate data through complex statistical analyses.

The course is offered in two versions, for 8 CFU and 10 CFU respectively. Both versions are named Applied Statistics and are described in the present document. In the 8 CFU version, the course will cover classical topics of Multivariate Data Analysis (e.g.: dimension reduction, inference for regression, supervised and unsupervised classification) with overtures to the modern treatment of multiple inference (FDR) or to novel non parametric methods for inference (permutational tests). To these contents, the 10 CFU version of the course adds two modules of Advances in Statistical Learning, leveraging on seminars offered by experts of the fields (e.g.: Functional Data Analysis, Spatial Statistics).

Through an extensive recourse to blendend learning and flipped classroom innovative teaching, both during the lectures and the lab sessions, the course will enahace the students ability in autonomous learning; this will also be accomplished by perusing new teaching media like the MOOC on Statistical Learning by Hastie and Tibshirani referenced in the Bibliography.

Particular emphasis is given to the devolopment of a responsible and efficient attitude toward work in a multidisciplinary environment where the data scientist collaborates on a peer-to-peer basis with other engineers and scientists for the modeling and understanding of complex phenomena of scientific or industrial interest. Hence, the course requires team work on real data statistical analyses, either in interaction with industrial partners of the statistical research group of the Department of Mathematics or in collaboration with scientists and engineers demanding a data-driven approach to the solution of their research problem. Work-in-progress of the teams will be periodically discussed in class, drawing attention not only on the appropriatness of the statistical methods and algorithms, but also on the communication skills of the teams when informing an audience of peers.

The course fits into the overall program curriculum pursuing some of the defined general learning goals. In particular, the course contributes to the development of the following capabilities:

Design models and utilize information from experimental data for inferential estimation, verification and adaptation of the models and for their heuristic developments;

Provide the statistical interpretation and simulation of scenarios for the treatment of data in situations of high complexity;

Use R for the analysis and treatment of complex data.

Risultati di apprendimento attesi

Both versions of the course have the following common learning goals:

- At the end of the course students are expected to be able to design - and run with R - data analyses aimed at: reducing the dimensionality of a dataset, solving a classification problem, both supervised or unsupervised, making multiple inference on multiple mean vectors or their components, choosing the most appropriate regression model and fitting it while making inference on its components, handling classical (OLS) or more modern approaches to model building and selection (ridge regression and lasso). Leveraging on their engineering forma mentis and on the skills in data analysis acquired in the course, students are expected to be able to evaluate the practical and statistical significance of the final result of their data analysis, to quantify its uncertainty analytically or through simulation, e.g. by cross-validation, and to diagnose its potential shortcomings, either when used to provide an empirical explanation of the industrial or scientific problem under study or when the main goal of the analysis is to formulate predictions.

- To prepare for responsible and efficient interactions in a working environment, every student is required to take part in a real data analysis project developed by an independently formed team of 2-4 members. The work in progress of the projects will be collectively discussed during general meetings scheduled along the course; final analyses and results will be presented in a workshop which will take place at the end of the course.

- Students are expected to be able to communicate and transfer the results and interpretations of their data analysis to a wider audience of experts, using effective graphical or analytical means, suitable for the problem under study, the findings of the analysis and, finally, the audience.

Moreover, the students of the 10 CFU version of the course are expected to have a basic understanding of the more complex issues - and the methods and algorithms developed to solve them - characterizing the statistical analysis of data when spatial dependence is of importance or when data do not belong to the finite dimensional Euclidean setting considered in the classical Multivariate Data Analysis framework, like in the case of functional data.

Argomenti trattati

The topics covered by the 8 CFU version of the course are the following:

1) Exploring a multivariate dataset: descriptive statistics and graphical displays. The geometry of a multivariate sample. Generalized Variance. The metric induced by the covariance matrix.

2) Data representation and dimensional reduction: the analysis of the covariance structure, principal component analysis. Independent component analysis.

3) Inferences about a mean vector: Hotelling T^2 test. Confidence regions and simultaneous comparisons of component means. The Bonferroni method for multiple comparisons. Familywise Error Rate and False Discovery Rate. Comparisons of several multivariate means. ANOVA and MANOVA. Inference for Linear Models. Beyond Ordinary Least Squares: ridge regression, lasso, regularized least Squares. Cross-validation and model selection.

4) Discrimination, classification, clustering. Statistical classification: model, misclassification costs and prior probability. Bayesian supervised classification and the Fisher approach to discriminant analysis. Alternative approaches to classification: support vector machines and CARTs. Similarity measures. Unsupervised classification; hierarchical and nonhierarchical methods. Multidimensional scaling.

In the 10 CFU version of the course, the above topics are complemented with the following two modules of Advances in Statistical Learning:

5) Advances in Statistical Learning: Functional Data Analysis. Data smoothing, dimensional reduction and representation. Functional principal component analysis. Data registration: phase and amplitude variability. Classification of functional data.

6) Advances in Statistical Learning: spatial data. Random fields, variogram models and variogram fitting. Spatial prediction and Kriging, Functional data with spatial dependence.

In both versions of the course, methods and algorithms will be illustrated in the lab sessions through applications to real data sets; analyses will be performed in R, an opensource package for the statistical analysis downloadable at www.r-project.org .

Through the course, students are required to work in team on a real data analysis project whose progress will be shown periodically to the class.

Prerequisiti

Basic courses in Probability and Statistics for Engineering are a required prerequiste. A course in Inferential Statistics, at least at the introductory level, is a suggested prerequisite.

Modalità di valutazione

The exam consists of two parts:

(a) A written exam. The written exam is made up of a few data analysis problems to be individually solved with R; usually four for the students following the 8 CFU version of the course, five - with extra time - for those registered in the 10 CFU version. For the students of the 10 CFU version of the course, the extra problem will be related to the topics treated in the modules characterizing their additional 2 CFU. During the exam, all students are allowed to use a personal computer as well as to consult books, personal notes etc.

In the written report the student must show the ability to conduct a stylized data analysis, by selecting the appropriate methods and algorithms - among those introduced in the course - for solving the problems, by running the algorithms with R, by identifying the significant results and by reporting them with the precision and property of language which characterize the technical and scientific communication.

(b) Team project evaluation. Projects will be collectively evaluated by the teachers of the course and by the students participating to a final workshop at the end of the course.

During the presentation of their projects, teams must prove their ability to conduct and report a real life statistical data analysis, showing knowledge and understanding of the problem at hand and the nature of the data, proper judgment for the selection of the appropriate methods and algorithms, sensible interpretations of the generated results - grasping not only their strenghts but also their weaknesses - and, finally, communication skills when informing an audience of peers.

To pass the exam the student must pass each part of the exam with a score greater than or equal to 18/30; the final score is then obtained as the weighted average of the two scores, with weights respectively equal to 0.6 for the written exam and 0.4 for the project evaluation.

Bibliografia

Johnson, R.A. and Wichern, D.W., Applied Multivariate Statistical Analysis (sixth edition), Editore: Prentice Hall, Anno edizione: 2007
Statistical Learning MOOC by Hastie and Tibshirani http://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/James G., Witten D., Hastie T. and Tibshirani R., An introduction to statistical learning, with application to R, Editore: Springer, Anno edizione: 2013
Ramsay, J.O. e Silverman, B.W., Functional Data Analysis (second edition), Editore: Springer, Anno edizione: 2005
Cressie, N., Statistics for Spatial Data (Revised Edition), Editore: Wiley, Anno edizione: 1993

Software utilizzato

Software

Info e download

Virtual desktop

Ambiente virtuale fruibile dal proprio portatile dove vengono messi a disposizione i software specifici per all¿attività didattica

PC studente

Indica se è possibile l'installazione su PC personale dello studente

Aule

Verifica se questo software è disponibile in aula informatizzata

Altri corsi

Verifica se questo software è utilizzato in altri corsi