IT-Universitetet i København
 
  Tilbage Kursusoversigt
Kursusbeskrivelse
Kursusnavn (dansk):Introduction to Data Science 
Kursusnavn (engelsk):Introduction to Data Science 
Semester:Forår 2017 
Udbydes under:Bachelor i softwareudvikling (bswu) 
Omfang i ECTS:7,50 
Kursussprog:Engelsk 
Kursushjemmeside:https://learnit.itu.dk 
Min. antal deltagere:15 
Forventet antal deltagere:
Maks. antal deltagere:25 
Formelle forudsætninger:The course assumes familiarity with programming (for example obtained in the course Introduction to programming), basic probability theory (for example obtained in the course Discrete Mathematics), and database systems (for example obtained in the course Introduction to Databases). 
Læringsmål:Upon completion of the course, students should be able to:
- Describe and debate the need and motivation for data science.
- Explain the CAP theorem and its implications for existing consistency models.
- Enumerate and discuss key partition management techniques, data models and data management systems available for handling large datasets.
- Manipulate and visualize data in R.
- Perform modeling and model validation in R.
- Use supervised and unsupervised learning methods in R.
- Document a reproducible data analysis process.
- Manipulate and process data in Spark. 
Fagligt indhold:Throughout history, the amount of data produced and stored has been growing exponentially, but it is only in recent years that this exponential growth has really come to the fore. Data science is emerging as a new discipline at the intersection of statistics and computing. It deals with the challenges of processing and deriving insight from data.
The course is organized in two parts: 1) Introduction to statistical programming, and 2) big data management.

Statistical programming is at the heart of data science. It deals with describing, in a programming language, models that can be used to describe data and fitting these models to observational data. In turn, these models can lead to visualizations or be used for predictions. We will cover elementary statistical modeling and its application in the R programming language, for example:
- Data manipulation
- Modeling and hypothesis testing
- Application of learning methods (e.g. regression analysis, clustering)
- Visualization

The methods in the first part of the course are suitable only for data sets that are not too large. According to Wikipedia, big data is an “all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process them using traditional data processing applications.” The management and analysis of such data sets lead to significant technical, administrative and ethical challenges, but also significant opportunities.

The second part of the course will explore big data concepts and techniques. Among the topics covered in the course are:
- The three Vs: Volume; Variety; Velocity;Vision; Verification; Validation, Value.
- Data models, noSQL systems and associated data management techniques.
- The CAP-theorem and its implications.
- Data analysis in Spark. 
Læringsaktiviteter:14 ugers undervisning bestående af forelæsninger og øvelser

The course consists of lectures and hands-on exercises. There will be 4 mandatory hand-ins, carried out in groups, whose solutions involve all learning objectives. 

Obligatoriske aktivititer:There will be 4 mandatory hand-ins, 2 for each part, that must be completed to qualify for the exam.
If a hand in is not approved the first time, the student will be allowed to resubmit.
Deadlines for submission and re-submission of the hand-ins will be published on LearnIT.
Be aware: The student will receive the grade NA (not approved) at the ordinary exam, if the mandatory activities are not approved and the student will use an exam attempt.

Be aware: The student will receive the grade NA (Not Approved) at the ordinary exam, if the mandatory activities are not approved and the student will use an exam attempt. 
Eksamensform og -beskrivelse:A11: Skriftlig eksamen (stedprøve) med adgang til internet, skriftlige og trykte hjælpemidler., (7-scale, external exam)

tba  

Litteratur udover forskningsartikler:Thomas Mailund. Introduction to Data Science and Statistical Programming in R. https://leanpub.com/datascience_and_R

Lecture notes on elementary statistical concepts

Optional reading: Jules J. Berman. Principles of Big Data: Preparing, sharing, and analyzing complex information. Morgan Kaufmann, 2013.