Mastering Apache Spark
Methodology
The program is designed to provide an overall conceptual framework and common design patterns. Key concepts in each area will be explained and working code provided. Participants will be able to run the examples and expected to understand code. While explanation of key concepts is provided, a detailed code walk-though is usually not feasible in the interest of time. Code is written in Java and Scala. Therefore prior knowledge of these languages will be helpful to understand the code-level implementation of key concepts.
Training Highlights
- Complete coverage of Spark Programming fundamentals
- Each participant creates a fully functional multi-node Spark cluster
- End-To-End real-time analytics with Spark
- Integration with Hadoop, Hive, HBase, Cassandra and many more.
- Participants build a complete project during the course
- Hadoop developers
- ETL developers
- Java developers
- BI Professionals.
Intended Audience
Training Goals
To provide a thorough understanding of concepts of in-memory distributed computing and Spark API, so as to enable participants in development of Spark programs of moderate complexity.
Curriculum
- What is Big Data? – The 3V Paradigm
- Limitations of Conventional Technologies
- Essentials of Distributed Computing
- Introduction to Hadoop & Its Ecosystem
Introduction to Big Data Analytics
- Spark Background & Overview
- Spark Architecture & RDD Basics
- Common Transformations & Actions
Spark Essentials
- Exercise : Installing & Configuring Spark 1.6.2 Cluster
- Exercise : Simple Word Count Program Using Eclipse
- Exercise: Analysing Stock Market Data
Setting Up a Spark Cluster
- Concepts of Key/Value Pair
- Per Key Aggregations using Map, Shuffle & Reduce
- Transformations & Actions on Pair RDDs
- Exercise:Finding Companywise Total Dividend Paid
- Exercise : Determining Top 5 Dividend playing companies
- Two Pair RDD Transformations- Joins in Spark
- Exercise : Correlating price movement to dividend payment
Working With Pair RDDs
- Various Sources of RDDs
- Exercises : Loading & Saving Data from Flat Files
- Introduction To HDFS
- Exercise Setting Up a HDFS Cluster
- Loading & saving Files in HDFS
Basic Input / Output
- Spark Submit & Job Configuration
- Job Execution LifeCycle
- Introduction to YARN
- Exercise: Deploying Spark Applications On Yarn Cluster
Deploying On a Cluster
- SparkSQL Basics & Architecture
- Exercise: Creating Data frames from JSON, CSV & MySQL Tables
- Creating Temporary Tables & querying them with HQL
- Exercise: Analyzing and summarizing Weather Data using HQLStoring & Caching Tables
- Exercise :Setting Up a Metastore Service and querying tables using external JDBC applications
SparkSQL
- Hive Formats & SerDe’s
- Exercise: Working with Hive Tables & Formats – ORC & XML
- Exercise: Creating & Using User Defined Functions(UDF)
- Using SparkSQL shell to query Hive Tables
- Using Beeline Shell
Hive Integration
- Spark streaming Architecture Dstreams
- Transformations on Dstreams
- Exercise:Counting Words used from streaming Data
- Stateful & Stateless Word Count
- Recovering from Faults in Spark Streaming
- Introduction to Flume
- Using Spark streaming as a flume Sink
- Exercise: RealTime Analytics Design Patterns
Spark Streaming
- Using Combiners to avoid multiple actions & data Columns
- Accumulators ,Broadcast Variables
- Exercise : Counting corrupted records using Accumulators
- Exercise; Sharing a Lookup Tables across Partitions
- Working on a Per Partiton Basis
Advanced Topics
- Understanding Spark’s Execution Plan(DAG)
- Using Logs and Web UI to identify Problems
- Key Performance Considerations
Performance Tuning