Introduction to Apache Spark and the ecosystem
- Introduction to big data and its importance
- An overview of the Apache Spark ecosystem and how it compares to other big
data technologies
- Installing and configuring Apache Spark and preparing the development
environment
- Basics of RDD (Resilient Distributed Dataset) and its operations
- Practical exercise: Creating the first Spark application using RDD
- Discussion of the advantages and disadvantages of RDD
- Introduction to Datasets and DataFrames for more efficient work with
data
Advanced data processing with Apache Spark
- Detailed view of DataFrames and operations with them
- SQL queries in Spark and working with Spark SQL
- Hands-on: Data transformation and aggregation using Spark SQL and
*taFrames
- Introduction to stream data processing with Apache Spark Streaming
- Hands-on exercise: A simple streaming application
Machine learning and advanced data analysis in Spark
- Overview of MLlib (Machine Learning Library) in Spark
- Building and evaluating machine learning models
- Hands-on exercise: Classification, regression and clustering with MLlib
- Spark integration with other data stores (eg HDFS, Amazon S3)
Optimizing and tuning the performance of Spark
applications
- Monitoring and debugging Spark applications
- Working with Spark UI for application performance analysis
- Performance optimization using partitioning and persistence
- Practical tips and tricks for efficient processing of big data
Scaling and deploying Spark applications
- Spark cluster architecture and its configuration
- Scaling Spark applications vertically and horizontally
- Deploying Spark applications in a production environment
- Best practices for working with Apache Spark
- Final discussion, answers to questions and feedback from
participants