Course Outline
PySpark & Machine Learning
Module 1: Big Data & Spark Foundations
- An overview of the Big Data ecosystem and Spark's role in contemporary data platforms
- Comprehending Spark architecture: drivers, executors, cluster managers, lazy evaluation, DAGs, and execution planning
- Distinguishing between RDD and DataFrame APIs and determining when to utilize each approach
- Creating and configuring a SparkSession, along with understanding the fundamentals of application configuration
Module 2: PySpark DataFrames
- Reading and writing data to and from enterprise sources and formats (CSV, JSON, Parquet, Delta)
- Manipulating PySpark DataFrames: performing transformations and actions, utilizing column expressions, filtering, joins, and aggregations
- Executing advanced operations such as window functions, managing timestamps, and working with nested data structures
- Implementing data quality checks and writing reusable, maintainable PySpark code
Module 3: Processing Large Datasets Efficiently
- Grasping performance fundamentals: partitioning strategies, shuffle behavior, caching, and persistence
- Utilizing optimization techniques such as broadcast joins and execution plan analysis
- Efficiently processing large datasets and adhering to best practices for scalable data workflows
- Understanding schema evolution and modern storage formats prevalent in enterprise environments
Module 4: Feature Engineering at Scale
- Conducting feature engineering with Spark MLlib: addressing missing values, encoding categorical variables, and scaling features
- Designing reusable preprocessing steps and preparing datasets for Machine Learning pipelines
- Introduction to feature selection techniques and handling imbalanced datasets
Module 5: Machine Learning with Spark MLlib
- Understanding the MLlib architecture and the Estimator/Transformer pattern
- Training regression and classification models at scale (Linear Regression, Logistic Regression, Decision Trees, Random Forest)
- Comparing models and interpreting results within distributed Machine Learning workflows
Module 6: End-to-End ML Pipelines
- Constructing end-to-end Machine Learning pipelines that integrate preprocessing, feature engineering, and modeling
- Applying train/validation/test split strategies
- Conducting cross-validation and hyperparameter tuning using grid search and random search
- Structuring reproducible Machine Learning experiments
Module 7: Model Evaluation & Practical ML Decision Making
- Applying suitable evaluation metrics for regression and classification problems
- Identifying overfitting and underfitting and making practical decisions regarding model selection
- Interpreting feature importance and gaining a deeper understanding of model behavior
Module 8: Production & Enterprise Practices
- Persisting and loading models in Spark
- Implementing batch inference workflows on large datasets
- Understanding the Machine Learning lifecycle within enterprise environments
- An introduction to versioning, experiment tracking concepts, and basic testing strategies
Practical Outcome
- Ability to work independently with PySpark
- Ability to process large datasets efficiently
- Ability to perform feature engineering at scale
- Ability to build scalable Machine Learning pipelines
Requirements
Participants should possess the following background:
Basic proficiency in Python programming, including familiarity with functions, data structures, and libraries
A fundamental grasp of data analysis concepts, such as datasets, transformations, and aggregations
Basic knowledge of SQL and relational data principles
An introductory understanding of Machine Learning concepts, including training datasets, features, and evaluation metrics
While not strictly required, familiarity with command line interfaces and basic software development practices is recommended
Experience with Pandas, NumPy, or comparable data processing libraries is beneficial but not mandatory.
Testimonials (1)
I liked that it was practical. Loved to apply the theoretical knowledge with practical examples.