Mastering Apache Spark

Spark is fast emerging as an alternative to Hadoop & Map/Reduce due to its speed. Spark Programming is often necessary to address complex processing loads, involving huge data volumes, which can’t processed by Hadoop in a timely manner. Its in-memory computing engine makes Spark the choice of platform for real-time analytics, which requires high speed data ingestion and processing within seconds. A whole new generation of analytics applications is now emerging to process geo-location data, streaming web events, sensors data, as well as data received from mobile and wearable devices.

Methodology

The program is designed to provide an overall conceptual framework and common design patterns. Key concepts in each area will be explained and working code provided. Participants will be able to run the examples and expected to understand code. While explanation of key concepts is provided, a detailed code walk-though is usually not feasible in the interest of time. Code is written in Java and Scala. Therefore prior knowledge of these languages will be helpful to understand the code-level implementation of key concepts.

Training Highlights

Complete coverage of Spark Programming fundamentals
Each participant creates a fully functional multi-node Spark cluster
End-To-End real-time analytics with Spark
Integration with Hadoop, Hive, HBase, Cassandra and many more.
Participants build a complete project during the course
- Hadoop developers
- ETL developers
- Java developers
- BI Professionals.

Training Goals

To provide a thorough understanding of concepts of in-memory distributed computing and Spark API, so as to enable participants in development of Spark programs of moderate complexity.

Curriculum

- What is Big Data? – The 3V Paradigm
- Limitations of Conventional Technologies
- Essentials of Distributed Computing
- Introduction to Hadoop & Its Ecosystem
- Spark Background & Overview
- Spark Architecture & RDD Basics
- Common Transformations & Actions
- Exercise : Installing & Configuring Spark 1.6.2 Cluster
- Exercise : Simple Word Count Program Using Eclipse
- Exercise: Analysing Stock Market Data

- Concepts of Key/Value Pair
- Per Key Aggregations using Map, Shuffle & Reduce
- Transformations & Actions on Pair RDDs
- Exercise:Finding Companywise Total Dividend Paid
- Exercise : Determining Top 5 Dividend playing companies
- Two Pair RDD Transformations- Joins in Spark
- Exercise : Correlating price movement to dividend payment
- Various Sources of RDDs
- Exercises : Loading & Saving Data from Flat Files
- Introduction To HDFS
- Exercise Setting Up a HDFS Cluster
- Loading & saving Files in HDFS
- Spark Submit & Job Configuration
- Job Execution LifeCycle
- Introduction to YARN
- Exercise: Deploying Spark Applications On Yarn Cluster

- SparkSQL Basics & Architecture
- Exercise: Creating Data frames from JSON, CSV & MySQL Tables
- Creating Temporary Tables & querying them with HQL
- Exercise: Analyzing and summarizing Weather Data using HQLStoring & Caching Tables
- Exercise :Setting Up a Metastore Service and querying tables using external JDBC applications
- Hive Formats & SerDe’s
- Exercise: Working with Hive Tables & Formats – ORC & XML
- Exercise: Creating & Using User Defined Functions(UDF)
- Using SparkSQL shell to query Hive Tables
- Using Beeline Shell

- Spark streaming Architecture Dstreams
- Transformations on Dstreams
- Exercise:Counting Words used from streaming Data
- Stateful & Stateless Word Count
- Recovering from Faults in Spark Streaming
- Introduction to Flume
- Using Spark streaming as a flume Sink
- Exercise: RealTime Analytics Design Patterns
- Using Combiners to avoid multiple actions & data Columns
- Accumulators ,Broadcast Variables
- Exercise : Counting corrupted records using Accumulators
- Exercise; Sharing a Lookup Tables across Partitions
- Working on a Per Partiton Basis
- Understanding Spark’s Execution Plan(DAG)
- Using Logs and Web UI to identify Problems
- Key Performance Considerations

VIEW TESTIMONIAL WRITE TESTIMONIAL

Mastering Apache Spark

Register Now

or call us now on +91 9850033661

Methodology

Training Highlights

Intended Audience

Training Goals

Curriculum

Introduction to Big Data Analytics

Spark Essentials

Setting Up a Spark Cluster

Working With Pair RDDs

Basic Input / Output

Deploying On a Cluster

SparkSQL

Hive Integration

Spark Streaming

Advanced Topics

Performance Tuning