Big Data Management (Technical)
AbstractThis course addresses the technical issues that emerge during the big data life cycle including collection, management, processing, and analytics. We discuss modern approaches to organising and reasoning about large, fast growing and diverse data-sets. We cover the principles of big data analysis, and illustrate a hands-on approach to big data modelling and management.
Big data is nowadays considered an asset that is affecting every aspect of our life. Recent developments in the technologies used by sensors, and the approaches that captures online user activities have significantly increased the size of the data that enterprises can retain, manage, and analyse. By managing and analysing these collected big data, we can create valuable opportunities. However, it also introduces several new challenges mainly due to the requirement for new systems that are capable of processing these large data. Few years ago, most data could be extracted and loaded into a single server centralised database where it could be analysed offline. Today, traditional database systems would fail to manage these data. Analysing, possibly in real-time, of big data is a key challenge for many organisations, institutions, and governments so that they can understand and adapt quickly to changing conditions. For example, a hospital could incorporate GPS data about the actual location of its ambulances and helicopters with data about the mission these vehicles are involved in, as well as emergency calls and current status in various emergency rooms in order to make decisions in real-time when faced with an emergency call (also in the face of large-scale disasters).
Big data management denotes the processes involved in making data from various data sources available for advanced analytics. There is no longer one approach that can fit all data management problems. For each problem, IT specialists have to decide on appropriate models and large scale data analysis systems to handle the relevant data.
This course addresses the technical issues that emerge during the collection, management, processing, and analytics of large-scale data. In this course we introduce modern approaches to organising and analysing large, fast growing and diverse data-sets. We will cover the characteristics and principles of big data analysis and the platforms and tools that are capable of managing big data. Students will be introduced to the technical skills necessary for assessment of current approaches to big data management and analytics and will acquire a hands-on experience using these technologies.
The main objectives of the course are to learn about the following:
- Parallel and distributed computing platforms.
- Writing analytics tasks (algorithms and queries) for these scalable platforms.
- Running analytics tasks on large clusters of machines.
- Understand how dividing a large job into parallel tasks can enhance the execution time of such job (i.e. improve its performance).
Along the way, students are expected to (1) learn programming languages that enables them to write analytics applications (such as Python and Scala); (2) understand and write distributed applications for distributed computing platforms such as Spark; and (3) acquire experience running their code on public clouds such as AWS.
This course assumes basic computer science and programming background. It requires that the participants have taken the introductory programming courses and a data modelling course (Introduction to Database Design) for the Software Development or Software Design study programmes.
Moreover the student must always meet the admission requirements of the IT University.
Intended learning outcomes
After the course, the student should be able to:
- Identify and explain the main principles and theoretical concepts of big data management systems.
- Analyze and discuss the characteristics and societal issues of data exploration and analysis with large, fast-growing, and diverse datasets.
- Reflect upon the relative merits of distributed computing platforms in the context of big data management.
- Use distributed computing platforms to implement end-to-end solutions for real-world analytics problems.
- Design, conduct, and report results of experiments using the developed applications in a distributed setting over a cluster of machines.
Ordinary examExam type:
C: Submission of written work, External (7-point scale)
C1G: Submission of written work for groups