The course aims to train the students in conducting a thorough and valid analysis of online data sources with the use of basic programming, statistics and business intelligence tools.
The course is based on the assumption that we live in a world where the amount of data grows rapidly, but where the data also exposes more information. To keep up with this development, it is therefore necessary to equip ourselves with tools and techniques that dramatically speed up our interaction with increasingly complex data sources.
The course aims to provide the students with tools to quickly answer meaningful questions about data by 1) minimising the amount of time it takes to arrive at the answer and 2) maximising the relevance of the answer.
The course is structured in four blocks:
- Automating data collection tasks with Python
- Retrieving large amounts of data from online sources
- Analysing data with basic statistics and business intelligence
- Consolidating analysis validity for real-world problems
The first two blocks establishes programming tools that allow fetching and preprocessing large amounts of data from modern and large data sources, for instance Twitter’s API, WTO data sources and Danmarks Statistik. The second block focuses on the extraction of knowledge from the data, while ensuring that the knowledge is meaningful and relevant (valid).
- Knowledge about fundamental Python programming
- Knowledge about database design and interaction
- Knowledge about basic scientific theory
Intended learning outcomes
After the course, the student should be able to:
- Write a Python program that extracts information from common data formats
- Write a Python program that visually presents structured data
- Discuss how to present information and findings using Python
- Explain techniques for processing data in Python, given the size and format of the data
- Explain the difference between databases in memory, on disk and distributed
- Write a Python program that interacts with HTTP APIs using simple authentication methods
- Account for basic statistical measures and regression models
- Explain the difference between precision, recall and accuracy
- Discuss how sample populations relate to real-world populations
- Reason about and describe a falsifiable question that can be addressed with a specific data source
- Provide data-driven answers to falsifiable questions using statistical measures and regression models
- Discuss the validity of analytical conclusion based on the method and data
April 24th, 2020: Exam changed due to the Covid-19 situation and the change to online exams.
The course will mainly consist of lectures, group work, and project work with a focus on active students’ participation and practical application of data handling techniques.
- Data Science from Scratch, Joel Grus, O’reilly, 2019
– Book: https://www.amazon.com/Data-Science-Scratch-Principles-Python/dp/1492041130
– Code: https://github.com/joelgrus/data-science-from-scratch/
Ordinary examExam type:
C: Submission of written work
C22: Submission of written work – Take home
Time and dateOrdinary Exam - hand out Tue, 19 May 2020, 09:00 - 19:00
Ordinary Exam - submission Wed, 20 May 2020, 08:00 - 14:00
Reexam - hand out Sun, 16 Aug 2020, 09:00 - 21:00
Reexam - submission Mon, 17 Aug 2020, 08:00 - 14:00