logo-polimi
Loading...
Risorse bibliografiche
Risorsa bibliografica obbligatoria
Risorsa bibliografica facoltativa
Scheda Riassuntiva
Anno Accademico 2019/2020
Tipo incarico Dottorato
Insegnamento 055067 - DATA MANAGEMENT FOR LARGE-SCALE ANALYTICS
Docente Brambilla Marco
Cfu 5.00 Tipo insegnamento Monodisciplinare

Corso di Dottorato Da (compreso) A (escluso) Insegnamento
MI (1387) - DATA ANALYTICS AND DECISION SCIENCESAZZZZ055067 - DATA MANAGEMENT FOR LARGE-SCALE ANALYTICS

Programma dettagliato e risultati di apprendimento attesi

Large-scale data analytics is everywhere and researchers from all disciplines are addressing this topic from their own perspective, creating vertical excellent experiments, but often loosing the wider picture. This course aims at providing the principles, practices and technologies that enable large-scale data analytics and thus foster practice and academic debate around data science.

The course is divided in three parts. The first part provides an introduction to big data and a transversal view on grand challenges to which big data can contribute. The second part presents the main paradigms and techniques for data modeling and management. The third one teaches how practically tame volume, variety, velocity, and veracity. The whole course will be focused on getting practical understanding and experience over the technologies, with special attention to Python and R as programming languages, and Spark as cloud data management infrastructure.

 

Part 1: INTRO. Grand challenges of Data Analytics
- Introduction to large-scale analytics
- Opportunities for social, environmental and economic problems
- Problem of current research in big data and data science
- Data access and quality issues


Part 2: DATA. Data models and their implementations
- Traditional ER and relational data models, SQL
- Transactional and active databases
- NoSQL data models: document, graph, column-based and key-value models
- NoSQL platforms and technologies
- Main memory large-scale databases


Part 3: FEATURES. Taming data volume, velocity, variety, and veracity
- Volume: Scaling computation and storage horizontally
- Map Reduce from Apache Hadoop to Apache Spark and Apache Flink
- Velocity: Information flow processing principle, approaches and tools
- Hands-on Apache Spark to tame volume and velocity in data analytics
- Veracity: data quality and data wrangling
- Variety: web data extraction and data integration


Note Sulla Modalità di valutazione

Students will be required to build a research case, identifying business value, data and methods, using the tools to analyze and visualize data, critically analyzing pitfalls, and highlighting their contributions.

The evaluation will be based on a concrete implementation of a case proposed by the instructors, where students will be asked to implement the data management phases discussed in class on a practical example, using cloud-based large-scale data management platforms and technologies.


Intervallo di svolgimento dell'attività didattica
Data inizio
Data termine

Calendario testuale dell'attività didattica

Classes will be held twice a week, based on 3-hours modules, starting beginning of February 2020.

Detailed calendar will follow.


Bibliografia
Risorsa bibliografica obbligatoriaMarting Kleppmann, Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, Editore: O'Reilly, Anno edizione: 2017, ISBN: 978-1449373320

Mix Forme Didattiche
Tipo Forma Didattica Ore didattiche
lezione
12.0
esercitazione
12.0
laboratorio informatico
0.0
laboratorio sperimentale
0.0
progetto
6.0
laboratorio di progetto
0.0

Informazioni in lingua inglese a supporto dell'internazionalizzazione
Insegnamento erogato in lingua Inglese
Disponibilità di materiale didattico/slides in lingua inglese
Disponibilità di libri di testo/bibliografia in lingua inglese
Possibilità di sostenere l'esame in lingua inglese
Disponibilità di supporto didattico in lingua inglese

Note Docente
schedaincarico v. 1.6.1 / 1.6.1
Area Servizi ICT
17/02/2020