CS-E4640 - Big Data Platforms D, Lecture, 11.1.2023-13.4.2023
This course space end date is set to 13.04.2023 Search Courses: CS-E4640
Topic outline
-
-
In this lecture we will discuss what a big data platform is about. We will study key motivations for us to learn topics of big data platforms.
-
We study and discuss key architectural principles for designing big data platforms.
- your scenario/story of big data
- data movement in big data platforms
- basic big data pipelines
- Lambda architecture
- Kappa architecture
- big data at large-scale
- key building blocks and technologies
- reactive systems for big data platforms
- partitioning
- data concerns
- component API, interaction, orchestration and coordination
- components distribution
- scalability and elasticity
- your scenario/story of big data
-
We examine service models and integration for big data platforms.
- Bring data into platforms
- data transfer/uploading models
- examples of technology stacks (Google, AWS, Azure)
- Messaging protocols for big data
- MQTT
- AMQP
- Optimizing service requests and functionalities
- Contention, back-pressure, elasticity
- Sharding
- Discovery and consensus in big data platforms
- Key techniques
- Examples of Zookeeper, consul, etcd.
-
Big data storages, databases and services in big data platforms.
- Consistency, Availability and Partition Tolerance
- Basic models, CAP/BASE
- Data models and data management
- Data models (File, relational data, Key-value model, document-oriented model, column family, graph)
- Examples with Cloud storage, Cassandra, Mongodb, etc.
-
Big data ingestion techniques.
- Big data ingestion
- Models
- Data formats/semantics
- Patterns for data ingestion
- Ingestion processes: architectures and tools
- Common
- Batch models
- Function-as-a-service models
- Microbatching
- Examples
- E.g., Logstash, using message brokers, Apache Nifi
-
We will discuss about Hadoop and its key components for big data ecosystem.
Distributed big data in clusters
- Hadoop File systems
- YARN
- Hadoop-native big database/data warehouse systems
- HBase
- Apache Hive
- Use Hadoop for complex data management and analytics
- Hadoop File systems
-
MapReduce and Spark programming models for big data processing.
- MapReduce programming model
- Apache Spark
- Real-world examples
-
Stream processing for big data and its relation to big data platforms.
- Stream processing and big data platforms
- Key concepts of stream processing
- Event models, processing functions, windows, consistency
- Parallelism in stream processing
- Apache Flink
-
Workflow technologies and frameworks for big data.
- The role of workflows for big data processing and platforms management
- Workflow models
- Common concepts, workflows of batch tasks, workflows of function-as-a-service
- Apache Airflow
-