Data in the Wild: Wrangling and Visualising Data (Autumn 2024)

Autumn 2024
Autumn 2021 Autumn 2022 Autumn 2023 Autumn 2024

Official course description, subject to change:

Basic info last published 15/03-24

Course info

Language:

English

ECTS points:

7.5

Course code:

KSDWWVD1KU

Participants max:

Offered to guest students:

Offered to exchange students:

Offered as a single subject:

Programme

Level:

MSc. Master

Programme:

MSc in Data Science

Staff

Course manager

Veronika Cheplygina

Associate Professor

Teacher

Luca Maria Aiello

Associate Professor, Head of study programme

Course semester

Semester

Efterår 2024

Start

26 August 2024

End

24 January 2025

Abstract

This course introduces students to the foundations of handling heterogeneous data sources through the steps of data collection, annotation, processing, cleaning, integration, transformation, and visualization. Ethical issues and dataset bias are also discussed.

Description

This course teaches how to design, implement and combine a suite of techniques for producing high-quality datasets, starting from the collection of raw data in the wild. The student taking this course will learn how to:

collect and integrate heterogeneous data from various sources (including open data, unstructured data, web scraping, proprietary APIs)
annotate it with appropriate metadata (e.g, using crowdsourcing)
clean it and transform it to satisfy given quality indicators (e.g., deduplication, anonymization, normalization),
iterate over these steps to meet the requirements of specific data science and machine learning problems

The course will also teach how to identify and address some ethical issues arising from data collection and handling, including possible implications of data biases. In parallel, the student will also gain experience with discussing, presenting, and visualizing key aspects that can document and inform the data processing steps and effectively describe the final datasets produced.

Formal prerequisites

There are no formal prerequisites for this course for students in the associated MSc program.

It will be helpful to have some experience with Python, and basic statistics.

Intended learning outcomes

After the course, the student should be able to:

Describe different data collection/annotation/visualization methods with regards to their strengths and weaknesses
Apply appropriate data collection/annotation/visualization methods in order to create novel datasets
Find suitable connections between dataset properties, analysis methods, and research questions
Extract insights from the data analysis and present the results with appropriate visualization and written reporting
Discuss the findings with respect to relevant work from the literature, and reflect on their real-world implications

Ordinary exam

Exam type:
D: Submission of written work with following oral, External (7-point scale)
Exam variation:
D1G: Submission for groups with following oral exam based on the submission. Shared responsibility for the report.