Spark Performance Tuning: A Field Guide

Diagnosing shuffles, skew and memory pressure in production

By Houssam Kodad

PDF 176 pages Advanced English

One-time purchase

€24.95

VAT included
where applicable

Download sample

Instant download after purchase
Readable on any device
Free updates to this edition
Secure checkout

About this book

What's inside

A Spark job that works on a sample can fall over on the full dataset, and the error message rarely tells you why. This focused field guide teaches you to read the Spark UI like an instrument panel and fix the real causes of slow jobs: shuffles, skew, spills and bad partitioning. It's short on theory and long on the diagnostic moves that turn a six-hour job into a forty-minute one.

What you'll learn

Skills you'll walk away with

Read the Spark UI to locate the real bottleneck
Diagnose and fix data skew in joins and aggregations
Control partitioning to avoid tiny and giant tasks
Recognise and reduce expensive shuffles
Tune memory to stop spills and out-of-memory failures
Use broadcast joins and adaptive query execution well
Right-size executors, cores and cluster resources

Table of contents

9 chapters

01
How Spark Actually Runs Your Job
- · Jobs, stages and tasks
- · Lazy evaluation and the DAG
- · Wide vs narrow transformations
02
Reading the Spark UI
- · The stages and tasks views
- · Spotting straggler tasks
- · Input, shuffle and spill metrics
03
The Shuffle, Demystified
- · Why shuffles are expensive
- · Shuffle partitions and their size
- · Reducing shuffle volume
04
Data Skew and How to Beat It
- · Detecting skew from the UI
- · Salting skewed keys
- · Skew-aware joins
05
Partitioning for Speed
- · Too many vs too few partitions
- · Repartition and coalesce
- · Partition pruning on read
06
Memory, Spills and OOMs
- · Execution vs storage memory
- · Diagnosing spills to disk
- · Avoiding out-of-memory crashes
07
Joins That Do Not Melt the Cluster
- · Broadcast hash joins
- · Sort-merge joins
- · Adaptive query execution
08
Right-Sizing the Cluster
- · Executors, cores and memory
- · Dynamic allocation
- · Cost vs runtime trade-offs
09
A Tuning Playbook
- · A repeatable diagnostic process
- · Before-and-after case studies
- · Quick wins checklist