Data Science in Production
Course info
Programme
Staff
Course semester
Exam
Abstract
This course will introduce classes of tasks that are at the core of most real-world production systems. It will teach advanced solutions to solve these tasks on complex and large-scale data with state-of-the-art tools.
Description
At the core of most IT production systems there are algorithmic solutions to problems of ranking and matching. Solving these two fundamental tasks enables a wide variety of services: getting the best list of images for a search query, getting a recommendation for the best next song to listen to, finding new friends in online social media, and much more. In this course, we will introduce advanced concepts of Information Retrieval, Recommenders Systems, Computational Advertising, and Dev-Ops tools to deploy these services at scale.
In particular, the course will cover the following subjects:
- Information retrieval systems
- Indexing large-scale data
- Ranking and weighting for relevance
- Search strategies
- Learning to Rank
- Grouping and detection of near duplicates
- Elasticsearch
- Recommender systems
- Content-based recommendations
- Collaborative filtering
- Dimensionality reduction
- Matrix factorization for personalization
- Multiarmed bandits
- Link recommendation
- Computational Advertising
- Advertisement auctions and bidding
- Advertisement matching
- A/B testing
- Dev-Ops concepts and tools
- Deployment
- Orchestration
- Dev-ops tools (e.g., Docker, Kubernetes)
- Metrics to evaluate the performance of ranking and matching systems
Formal prerequisites
A solid background in Python programming, Linear Algebra, and fundamentals of machine learning is required.
Intended learning outcomes
After the course, the student should be able to:
- Design and implement a recommender system that satisfies given requirement
- Design and implement simple information retrieval systems
- Design and implement methods to extract structured information from linked data
- Discuss possible architectural solutions to address complex problems of ranking and matching
- Recommend the most appropriate techniques and metrics to evaluate the performance of a given production task
- Design and implement software for basic deployment and orchestration of services
Learning activities
The course will consist of lectures and hands-on practice with
coding, mostly in Python.
The students will be presented with tasks that are
typical of IT production systems and they will be asked to reflect on them and
to propose possible solutions. These activities will be similar to those that
the students will need to complete for their exam. The students will also have
the opportunity to code some of the solutions they come up with.
After the lecture, the students will be invited to perform some complimentary activities including reading and watching videos that expand on the concepts discussed during the lectures.
Course literature
Some of the material included in the following books will be part of the course. These books are intended as optional reading and support material. The course will be self-contained and reading these books is not necessary to pass the exam with full grades.
- Mining massive datasets (http://www.mmds.org/)
- Modern information retrieval (https://www.amazon.com/Modern-Information-Retrieval-Concepts-Technology/dp/0321416910)
- Elasticsearch - the definitive guide (https://www.amazon.com/Elasticsearch-Definitive-Distributed-Real-Time-Analytics/dp/1449358543)
- Advanced Elasticsearch 7.0 (https://www.amazon.com/Advanced-Elasticsearch-7-0-practical-distributed/dp/1789957753)
- Recommender systems (https://www.amazon.com/Recommender-Systems-Textbook-Charu-Aggarwal/dp/3319296574)
- Recommender systems handbook (https://link.springer.com/book/10.1007/978-0-387-85820-3)
- Practical recommender systems (https://www.manning.com/books/practical-recommender-systems)
- Introduction to multiarmed bandits (https://www.nowpublishers.com/article/Details/MAL-068)
- Bandit algorithms for website optimization (https://www.amazon.com/Bandit-Algorithms-Website-Optimization-Developing/dp/1449341330)
- Trustworthy online controlled experiments (https://www.amazon.com/Trustworthy-Online-Controlled-Experiments-Practical/dp/1108724264)
- Computational advertising (https://www.amazon.com/Computational-Advertising-Peng-Liu/dp/1032241403)
- More literature will be published in the course page in LearnIT.
Student Activity Budget
Estimated distribution of learning activities for the typical student- Preparation for lectures and exercises: 5%
- Lectures: 25%
- Exercises: 35%
- Exam with preparation: 35%
Ordinary exam
Exam type:C: Submission of written work, External (7-point scale)
Exam variation:
C11: Submission of written work
The exam will consist of two parts. First, a series of open questions about the topics taught in class, to be answered with a short paragraph each. Second, a set of coding exercises in the domains of recommendation systems, information retrieval, graph mining, computational advertising, and DevOps. The students will be asked to comment the code to justify and explain their choices. The final submission will contain a pdf with the answers to the questions and a some software that implements the coding exercises (mostly Python code).
reexam
Exam type:C: Submission of written work, External (7-point scale)
Exam variation:
C11: Submission of written work
Time and date
Ordinary Exam - submission Fri, 2 Jun 2023, 08:00 - 14:00Reexam - submission Wed, 26 Jul 2023, 08:00 - 14:00