Please note! Course description is confirmed for two academic years, which means that in general, e.g. Learning outcomes, assessment methods and key content stays unchanged. However, via course syllabus, it is possible to specify or change the course execution in each realization of the course, such as how the contact sessions are organized, assessment methods weighted or materials used.
LEARNING OUTCOMES
After this course, the student is able to
- understand big data and platforms w.r.t. services, stakeholders, interactions and state-of-the-art technologies
- understand key interactions and performance design patterns in big data platforms
- produce designs of big data platforms with key services like data stores, data ingestion, batch and stream processing
- demonstrate design and implementation of big data ingestion, batch processing, streaming processing and data governance processes.
- assess performance and reliability issues in operating big data platforms
- deliver real-world prototypes of big data platforms with real datasets and technologies in a large-scale systems.
Credits: 5
Schedule: 13.01.2021 - 07.04.2021
Teacher in charge (valid 01.08.2020-31.07.2022): Linh Truong
Teacher in charge (applies in this implementation): Linh Truong
Contact information for the course (valid 09.12.2020-21.12.2112):
Due to COVID-19, the course will be online. Students can contact the professor in charge and TAs through:
- Using Microsoft Teams chat/message directly
- Using Microsoft Teams chat/message in the course space
- Using emails
CEFR level (applies in this implementation):
Language of instruction and studies (valid 01.08.2020-31.07.2022):
Teaching language: English
Languages of study attainment: English
CONTENT, ASSESSMENT AND WORKLOAD
Content
Valid 01.08.2020-31.07.2022:
The course will provide knowledge covering main aspects of big data platforms, including data platform services and ecosystems, architectures and designs for big data, core services in big data stores, big data ingestion techniques, big data processing models (batch and streaming), and big data governance. Common aspects like users, developers and providers interactions, reliability, performance and elasticity for big data plaforms will be studied and implemented. Both design, development and operations of big data platforms are covered.
Applies in this implementation:
Lectures:
- Introduction to Big Data Platforms
- Architecting Big Data Platforms
- Service and Integration Models in Big Data Platforms
- Big Data Storage and Database Services
- Big Data Ingestion
- Hadoop and Its Big Data Ecosystems
- Big Data Processing with Mapreduce/Spark Programming Models
- Streaming Processing and Big Data Platforms
- Workflows for Big Data Platforms
Tutorials:
- Some industrial and open source big data platforms for Your tech radar
- Hands-on examples with big database services
- Data Ingestion with Apache Nifi
- Hadoop
- Data Processing with Apache Spark
- Stream Processing with Apache Flink
- Data processing with Apache Airflow
Meetups:
- A Taste of Big Data Platforms
- How to succeed on assignments in Big Data Platforms
- Issues in time series data ingestion
- Big Data Platforms and Microservices
Assessment Methods and Criteria
Valid 01.08.2020-31.07.2022:
Assigments and exams (based on Q/A for assignments). Each assignment will include theoretical concepts, big datasets, component designs, software implementation and testing, and extensibility/integration discussions.
Applies in this implementation:
Three assignments will be given.
Workload
Valid 01.08.2020-31.07.2022:
Lectures: 10 (2), Teaching in small groups: 7 (1), Independent work, including self-study and assignments: 88
Note the workload ratios:
Method Teaching hours Indepdent work Total workload Lecture 20 20 40 Exercise 7 0 7 Asssignments 88 88 Total 135 Applies in this implementation:
Lecture:
- Teaching hours: 18, Independent work: 18, Total workload: 40
Exercise (hands-on and meetups):
- Teaching hours:7
- Meetups: 4
Assignments:
- 88 hours
DETAILS
Study Material
Valid 01.08.2020-31.07.2022:
Lecture slides, tutorials, open sources, and assignments
Applies in this implementation:
- Slides can be found in CS-E4640 Mycourses
- Examples, tutorials and previous teaching materials can be found in CS-E4640 GIT
Prerequisites
Valid 01.08.2020-31.07.2022:
This course requires background and knowledge about cloud computing, distributed computing, operating systems, and basic databases. For students to fulfil such background and knowledge, students must either (1) finish the follow courses in Aalto: CS-C3140 Operating Systems and CS-E4150 Cloud Software and Systems, or (2) demonstrate that students understand relevant concepts and technologies like distributed computing infrastructures, service discovery, virtualization and containers, distributed filesystems and databases. Furthermore, students must be able to program well with one or two common programming languages: Java, JavaScript, GoLang, Python, and Scala.
Prerequisites will be checked through students's completed courses and/or through pre-assignment/interview with responsible teachers.
It is an advantage for the study if students have finished courses covering topics in Parallel Computing, NoSQL databases, Service Design as well as to be able to work with more than the above-mentioned programming languages as well as to be familar with working in large-scale computing enviromments.
FURTHER INFORMATION
Details on the schedule
Applies in this implementation:
- Teacher: Chen Hsin-Yi
- Teacher: Harlicaj Eljon
- Teacher: Nguyen Tri
- Teacher: Pham Phuong
- Teacher: Raj Rohit
- Teacher: Truong Linh