Spark Performance Tuning: A Field Guide
Diagnosing shuffles, skew and memory pressure in production
By Houssam Kodad
One-time purchase
€24.95
VAT included
where applicable
- Instant download after purchase
- Readable on any device
- Free updates to this edition
- Secure checkout
About this book
What's inside
A Spark job that works on a sample can fall over on the full dataset, and the error message rarely tells you why. This focused field guide teaches you to read the Spark UI like an instrument panel and fix the real causes of slow jobs: shuffles, skew, spills and bad partitioning. It's short on theory and long on the diagnostic moves that turn a six-hour job into a forty-minute one.
What you'll learn
Skills you'll walk away with
- Read the Spark UI to locate the real bottleneck
- Diagnose and fix data skew in joins and aggregations
- Control partitioning to avoid tiny and giant tasks
- Recognise and reduce expensive shuffles
- Tune memory to stop spills and out-of-memory failures
- Use broadcast joins and adaptive query execution well
- Right-size executors, cores and cluster resources
Table of contents
9 chapters-
01
How Spark Actually Runs Your Job
- · Jobs, stages and tasks
- · Lazy evaluation and the DAG
- · Wide vs narrow transformations
-
02
Reading the Spark UI
- · The stages and tasks views
- · Spotting straggler tasks
- · Input, shuffle and spill metrics
-
03
The Shuffle, Demystified
- · Why shuffles are expensive
- · Shuffle partitions and their size
- · Reducing shuffle volume
-
04
Data Skew and How to Beat It
- · Detecting skew from the UI
- · Salting skewed keys
- · Skew-aware joins
-
05
Partitioning for Speed
- · Too many vs too few partitions
- · Repartition and coalesce
- · Partition pruning on read
-
06
Memory, Spills and OOMs
- · Execution vs storage memory
- · Diagnosing spills to disk
- · Avoiding out-of-memory crashes
-
07
Joins That Do Not Melt the Cluster
- · Broadcast hash joins
- · Sort-merge joins
- · Adaptive query execution
-
08
Right-Sizing the Cluster
- · Executors, cores and memory
- · Dynamic allocation
- · Cost vs runtime trade-offs
-
09
A Tuning Playbook
- · A repeatable diagnostic process
- · Before-and-after case studies
- · Quick wins checklist
This is the full chapter list — exactly what you'll receive in the PDF.
More in Data Engineering
Keep exploring this track
Building Reliable Data Pipelines with dbt and Airflow
Orchestration, testing and incremental models for production warehouses
Streaming Data Engineering with Kafka and Flink
Real-time pipelines, exactly-once processing and stateful streams
Data Modeling for Analytics
Dimensional design, slowly changing dimensions and the one-big-table debate
Data Quality and Observability
Contracts, tests and lineage for pipelines you can trust