Topic outline

    • In this lecture we will discuss what a big data platform is about. We will study key motivations for us to learn topics of big data platforms.


    • We study and discuss key architectural principles for designing big data platforms. 

      • your scenario/story of big data
      • data movement in big data platforms
      • basic big data pipelines
        • Lambda architecture
        • Kappa architecture
      • big data at large-scale
        • key building blocks and technologies
        • reactive systems for big data platforms
        • partitioning
        • data concerns
        • component API, interaction, orchestration and coordination
        • components distribution
        • scalability and elasticity
    • We examine service models and integration for big data platforms.

      • Bring data into platforms
        • data transfer/uploading models
        •  examples of technology stacks (Google, AWS, Azure)
      • Messaging protocols for big data
        • MQTT
        • AMQP
      • Optimizing service requests and functionalities
        • Contention, back-pressure, elasticity
        • Sharding
      • Discovery and consensus in big data platforms
        • Key techniques
        • Examples of Zookeeper, consul, etcd.
    • Big data storages, databases and services in big data platforms.

      • Consistency, Availability and Partition Tolerance
        • Basic models, CAP/BASE
      • Data models and data management
        • Data models (File, relational data, Key-value model, document-oriented model, column family, graph)
        • Examples with Cloud storage, Cassandra, Mongodb, etc.
    • Big data ingestion techniques.

      • Big data ingestion
        • Models
        • Data formats/semantics
        • Patterns for data ingestion
      • Ingestion processes: architectures and tools
        • Common
        • Batch models
        • Function-as-a-service models
        • Microbatching
      • Examples
        • E.g., Logstash, using message brokers, Apache Nifi
    • We will discuss about  Hadoop and its key components for big data ecosystem.

      Distributed big data in clusters

      • Hadoop File systems
      • YARN
      • Hadoop-native big database/data warehouse systems
        • HBase
        • Apache Hive
      • Use Hadoop for complex data management and analytics
    • MapReduce and Spark programming models for big data processing.

      • MapReduce programming model
      • Apache Spark
      • Real-world examples

    • Stream processing for big data and its relation to big data platforms.

      • Stream processing and big data platforms
      • Key concepts of stream processing
        • Event models, processing functions, windows, consistency
        • Parallelism in stream processing
      • Apache Flink
    • Workflow technologies and frameworks for big data.

      • The role of workflows for big data processing and platforms management
      • Workflow models
        • Common concepts, workflows of batch tasks, workflows of function-as-a-service
      • Apache Airflow