Official course description:

Full info last published 31/07-19
Course info
Language:
English
ECTS points:
7.5
Course code:
KSBIDMT1KU
Offered to guest students:
yes
Offered to exchange students:
Offered as a single subject:
yes
Price for EU/EEA citizens (Single Subject):
10625 DKK
Programme
Level:
MSc. Master
Programme:
MSc in Software Design
Staff
Course manager
Postdoc
Course semester
Semester
EfterÄr 2019
Start
26 August 2019
End
31 January 2020
Exam
Abstract

This course addresses the technical issues that emerge during the big data life cycle including collection, management, processing, and analytics. We discuss modern approaches to organizing and reasoning about large, fast growing and diverse datasets. We cover the principles of big data analysis, and illustrate a hands-on approach to big data modeling and management.

Description

Big data is nowadays considered an asset that is affecting every aspect of our life.  Recent developments in  the technologies used by  sensors, and the approaches that captures online user activities have significantly increased the size of the data that enterprises can retain, manage, and analyse. By  managing and analyzing these collected big data, we can create valuable opportunities. However, it also introduces several new challenges mainly due to  the requirement for new systems that are capable of processing these large data. Few years ago, most data could be extracted and loaded into a single server centralized database where it could be analyzed offline. Today, traditional database systems would fail to manage these data. Analyzing, possibly in real-time, of big data is a key challenge for many organizations, institutions, and governments so that they can understand and adapt quickly to changing conditions. For example, a hospital could incorporate GPS data about the actual location of its ambulances and helicopters with data about the mission these vehicles are involved in, as well as emergency calls and current status in various emergency rooms in order to make decisions in real-time when faced with an emergency call (also in the face of large-scale disasters).


Big data management denotes the processes involved in making data from various data sources available for advanced analytics. There is no longer one approach that can fit all data management problems. For each problem, IT specialists have to decide on appropriate models and large scale data analysis systems to handle the relevant data.

This course addresses the technical issues that emerge during the collection, management, processing, and analytics of large-scale data. In this course we introduce modern approaches to organizing and analyzing large, fast growing and diverse datasets. We will cover the characteristics and principles of big data analysis and the platforms and tools that are capable of managing big data. Students will be introduced to the technical skills necessary for assessment of current approaches to big data management and analytics and will acquire a hands-on experience using these technologies.


The main objectives of the course are to learn about the following:

  • Parallel and distributed computing platforms.
  • Writing  analytics tasks (algorithms and queries) for these scalable platforms.
  • Running analytics tasks on large clusters of machines.
  • Understand how dividing a large job into parallel tasks can enhance the execution time of such job (i.e. improve its performance).

Along the way, students are expected to (1) learn programming languages that enables them to write analytics applications (such as phython and scala); (2) understand and write distributed applications for distributed computing platforms such as  Spark; and (3) acquire experience running their code on public clouds such as AWS.

Formal prerequisites

This course assumes basic computer science and programming background. It requires that the participants have taken the introductory programming courses and a data modeling course (Introduction to Database Design) for the Software Development or Software Design study programmes
Moreover the student must always meet the admission requirements of the IT University.

Intended learning outcomes

After the course, the student should be able to:

  • Identify and explain the main principles and theoretical concepts of big data management systems.
  • Analyze and discuss the characteristics and societal issues of data exploration and analysis with large, fast-growing, and diverse datasets.
  • Reflect upon the relative merits of distributed computing platforms in the context of big data management.
  • Use distributed computing platforms to implement end-to-end solutions for real-world analytics problems.
  • Design, conduct, and report results of experiments using the developed applications in a distributed setting over a cluster of machines.
Learning activities

The course will be based on lectures, exercises, and practical projects. The students will implement applications using several data management systems to analyze a range of different datasets in the exercises and projects.

  • Lectures will cover the basic concepts and the theoretical background of several parallel and distributed systems.

  • Exercises will provide the students  with hands-on activities to use parallel and distributed systems for managing and analyzing big data.

  • Three practical projects, in which the students will (1) design solutions for applications that analyze big data using distributed systems that they have studied in class and exercises, (2) implement these solutions, and (3) design and conduct experiments that evaluate the performance of the developed applications. Students submit reports for each of these projects throughout the course for feedback. The final versions of the project reports form the portfolio should be submitted as part of the final exam.


Course literature

Selected Chapters and Sections from the following list:
[1] Spark Machine Learning Library (MLlib) Guide. Available at: https://spark.apache.org/docs/latest/ml-guide.html.
[2] Spark RDD Programming Guide. Available at: https://spark.apache.org/docs/latest/rdd-programming-guide.html.
[3] Spark SQL, DataFrames and Datasets Guide. Available at: https://spark.apache.org/docs/latest/sql-programming-guide.html.
[4] D. Abadi. Consistency tradeos in modern distributed database system design: Cap is only part of the story. Computer, 45(2):37{42, 2012.
[5] M. Armbrust, T. Das, J. Torres, B. Yavuz, S. Zhu, R. Xin, A. Ghodsi, I. Stoica, and M. Zaharia.Structured streaming: A declarative api for real-time applications in apache spark. In Proc. ACM Int. Conf. on Management of Data (SIGMOD), pages 601{613. ACM, 2018.
[6] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al. Spark sql: Relational data processing in spark. In Proc. ACM Int. Conf. on Management of Data (SIGMOD), pages 1383{1394. ACM, 2015.
[7] J. Baker, C. Bond, J. C. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Leon, Y. Li, A. Lloyd, and V. Yushprakh. Megastore: Providing scalable, highly available storage for interactive services. 2011.
[8] J. Dean and S. Ghemawat. MapReduce: Simplied data processing on large clusters. In Proc. USENIX Conf. on Operating Systems Design and Implementation (OSDI), pages 137{150, 2004.
9] W. Fan and F. Geerts. Foundations of data quality management. Synthesis Lectures on Data Management, 4(5):1{217, 2012.
10] S. Ghemawat, H. Gobio, and S.-T. Leung. The google le system. In Proc. ACM Symp. on Operating Systems Principles (SOSP), pages 29{43, 2013.

[11] C. S. Horstmann. Scala for the Impatient. Pearson Education, 2012.

[12] C. Kacfah Emani, N. Cullot, and C. Nicolle. Understandable big data. Comput. Sci. Rev., 17(C):70{81, Aug. 2015.
[13] J. Kreps, N. Narkhede, J. Rao, et al. Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB, pages 1{7, 2011.
[14] N. Marz and J. Warren. Big Data: Principles and best practices of scalable real-time data systems. New York; Manning Publications Co., 2015.
[15] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research, 17(1):1235{1241, 2016.
[16] M. Odersky, L. Spoon, and B. Venners. Programming in scala. Artima Inc, 2008.
[17] D. Ongaro and J. Ousterhout. In search of an understandable consensus algorithm. In USENIX Annual Technical Conf. (ATC), pages 305{319, 2014.
[18] M. Scannapieco. Data Quality: Concepts, Methodologies and Techniques. Data-Centric Systems and Applications. Springer, 2006.
[19] S. Sivasubramanian. Amazon dynamodb: a seamlessly scalable non-relational database service. In Proc. ACM Int. Conf. on Management of Data (SIGMOD), pages 729{730. ACM, 2012.
[20] A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, et al. Storm@ twitter. In Proc. ACM Int. Conf. on Management
of Data (SIGMOD), pages 147{156. ACM, 2014.
[21] A. Verbitski, A. Gupta, D. Saha, M. Brahmadesam, K. Gupta, R. Mittal, S. Krishnamurthy, S. Maurice, T. Kharatishvili, and X. Bao. Amazon aurora: Design considerations for high throughput cloud-native relational databases. In Proc. ACM Int. Conf. on Management of
Data (SIGMOD), pages 1041{1052. ACM, 2017.
[22] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX Conf. on Networked Systems Design and Implementation (NSDI), pages 2{2. USENIX Association, 2012.
[23] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In Proc. ACM Symp. on Operating Systems Principles (SOSP), pages 423{438. ACM, 2013.
[24] A. Y. Zomaya and S. Sakr. Handbook of big data technologies. Springer, 2017.

Ordinary exam
Exam type:
C: Submission of written work, external (7-trinsskala)
Exam variation:
C: Submission of written work
Exam description:

The examination consists of written work. The exam includes (a) an individually written exam report and (b) a project portfolio consisting of the group reports for the practical projects.



reexam
Exam type:
Z. To be decided, - (-)

Time and date