CS-E4640 - Big Data Platforms D, 13.01.2021-07.04.2021
This course space end date is set to 07.04.2021 Search Courses: CS-E4640
Översikt
-
We will have 9 lectures, each is for 2 hours.
You should also do 18 hours for self-study. Overall, 36 hours should be spent for learning concepts. Further learning time will be spent in tutorials and the work in assignments.
-
Basic information about the course will be given:
- Important notes about grading: the course evaluation will be based on assignments (include design, implementation and discussion)
- Strict deadline
- Communications: in mycourses, slack and other means
-
In this lecture we will discuss what a big data platform is about. We will study key motivations for us to learn topics of big data platforms.
The lecture will be done on 13.01.2020.
-
We study and discuss key architectural principles for designing big data platforms. The lecture will be on 20.01.2020
- your scenario/story of big data
- data movement in big data platforms
- basic big data pipelines
- Lambda architecture
- Kappa architecture
- big data at large-scale
- key building blocks and technologies
- reactive systems for big data platforms
- partitioning
- data concerns
- component API, interaction, orchestration and coordination
- components distribution
- scalability and elasticity
- your scenario/story of big data
-
Cloud technologies are important for developing and operating big data platforms. We will discuss the roles of cloud infrastructures for big data platforms.
- How would cloud technologies affect big data platform designs
- service models and virtualization
- examples: Kubernetes, VM, containers, ...
- Cloud technologies empowering big data platforms
- manage infrastructural resources for big data platforms
- fault-tolerance, performance and elasticity
- microservices and devops
-
We examine service models and integration for big data platforms. The lecture will be on 27.01.2020
- Bring data into platforms
- data transfer/uploading models
- examples of technology stacks (Google, AWS, Azure)
- Messaging protocols for big data
- MQTT
- AMQP
- Optimizing service requests and functionalities
- Contention, back-pressure, elasticity
- Sharding
- Discovery and consensus in big data platforms
- Key techniques
- Examples of Zookeeper, consul, etcd.
-
Big data storages, databases and services in big data platforms. The lecture will be on 03.02.2020
- Consistency, Availability and Partition Tolerance
- Basic models, CAP/BASE
- Data models and data management
- Data models (File, relational data, Key-value model, document-oriented model, column family, graph)
- Examples with Cloud storage, Cassandra, Mongodb, etc.
-
Big data ingestion techniques. The lecture will be on 10.02.2020
- Big data ingestion
- Models
- Data formats/semantics
- Patterns for data ingestion
- Ingestion processes: architectures and tools
- Common
- Batch models
- Function-as-a-service models
- Microbatching
- Examples
- E.g., Logstash, using message brokers, Apache Nifi
-
We will discuss about Hadoop and its key components for big data ecosystem. The lecture will be on 03.03.2020
- Distributed big data in clusters
- Hadoop File systems
- YARN
- Hadoop-native big database/data warehouse systems
- HBase
- Apache Hive
- Use Hadoop for complex data management and analytics
-
MapReduce and Spark programming models for big data processing. The lecture will be on 10.03.2020
- MapReduce programming model
- Apache Spark
- Real-world examples
-
Stream processing for big data and its relation to big data platforms. The lecture will be on 24.03.2020
- Stream processing and big data platforms
- Key concepts of stream processing
- Event models, processing functions, windows, consistency
- Parallelism in stream processing
- Apache Flink
-
Workflow technologies and frameworks for big data. The lecture will be on 31.03.2020
- The role of workflows for big data processing and platforms management
- Workflow models
- Common concepts, workflows of batch tasks, workflows of function-as-a-service
- Apache Airflow
-