COVID-19 Notes for Spring 2022
Welcome to the CS-E4640!
The course is under lead of Hong-Linh Truong. The course will provide knowledge covering main aspects of big data platforms, including platform understanding and design, core services in big data storage, big data ingestion, big data processing and non-functional aspects like reliability, data governance and quality management. Both development and operations of big data platform are covered. Furthermore, services in big data platform ecosystems will be discussed.
To join this course, we expect the student to know basic cloud computing systems, database/data management, service design and DevOps in cloud computing. It is an advantage for the study of this Big Data Platforms course, if the student has already completed, for example, courses like Mobile Cloud Computing, Software Architectures, and Concurrent Programming.
Unofficially a course participant can audit the course as all materials, lectures and hands-on are public. However, to audit the course you must register into the course MS Teams to get information about sessions. For course audit, the participant will not be able to access to assignments and submit the assignments.
Imagine that you finish the course and become an "expert" of "Big Data Platforms" from @CSAalto. You work for a company and one day you get a request to build a big data platform for the company with your team (in this course your team is you, playing different roles). You might get a description like
“Your team has to build a big data platform for X types of data. Data will be generated/collected from N sources. We expect to have 10+ GBs/day of data to be ingested into our platform. We will have to serve K thousands of requests for different types of analytics – to be determined. Our response time should be in t milliseconds. Our services should not be …”
@PS: and things will be added and changed
And you know that big data is characterized many V properties (volume, velocity, variety, varacity, ...) and a platform must be able to facilitate different types of interactions for exchanging data and services, etc. You are faced with different questions related to the development and operation of big data platforms and their big data pipelines: how to design the big data platform which can be resilient, elastic and responsive that allow different customers and applications to be integrated? Which are the data models you have to select? Whether you have to support batch or streaming processing? etc. Also very practical issues like: should you use public cloud infrastructures or build your own. Which cloud companies should you rely? Google, Amazon or Microsoft?. Your story is not centered around a "narrow scope" of big data processing, like taking a lot of data, puting them into Hadoop and running ML algorithms (although it is not easy to achieve the work in such a "narrow scope") but you need to deal with a big picture of many tasks in big data platforms, involved in designs with microservices and serverless, reactive systems patterns, big data storage and database, complex data ingestions, various data processing models and algorithms atop them, to name just a few.
But of course, with a limited time in a 5 credit course, you cannot be the master of all aspects (BTW who could be the master of big data, given the complexitity of the field?). Thus you need to build your platform atop core concepts, practice your tasks with the four assignments, exploring the best skills you have in the big "Big Data Platforms" and let your other team members to work with you to deliver the "Big Data Platform" under your lead. Build your story!
All dates in the agenda are booked for Lectures and Tutorials
- See the current agenda.
- No lecture day: