Cover of Spark Performance Tuning: A Field Guide
DRM-free · Yours to keep forever
Data Engineering

Spark Performance Tuning: A Field Guide

Diagnosing shuffles, skew and memory pressure in production

By Houssam Kodad

PDF 176 pages Advanced English

One-time purchase

€24.95

VAT included
where applicable

Download sample
  • Instant download after purchase
  • Readable on any device
  • Free updates to this edition
  • Secure checkout

About this book

What's inside

A Spark job that works on a sample can fall over on the full dataset, and the error message rarely tells you why. This focused field guide teaches you to read the Spark UI like an instrument panel and fix the real causes of slow jobs: shuffles, skew, spills and bad partitioning. It's short on theory and long on the diagnostic moves that turn a six-hour job into a forty-minute one.

What you'll learn

Skills you'll walk away with

  • Read the Spark UI to locate the real bottleneck
  • Diagnose and fix data skew in joins and aggregations
  • Control partitioning to avoid tiny and giant tasks
  • Recognise and reduce expensive shuffles
  • Tune memory to stop spills and out-of-memory failures
  • Use broadcast joins and adaptive query execution well
  • Right-size executors, cores and cluster resources

Table of contents

9 chapters
  1. 01

    How Spark Actually Runs Your Job

    • · Jobs, stages and tasks
    • · Lazy evaluation and the DAG
    • · Wide vs narrow transformations
  2. 02

    Reading the Spark UI

    • · The stages and tasks views
    • · Spotting straggler tasks
    • · Input, shuffle and spill metrics
  3. 03

    The Shuffle, Demystified

    • · Why shuffles are expensive
    • · Shuffle partitions and their size
    • · Reducing shuffle volume
  4. 04

    Data Skew and How to Beat It

    • · Detecting skew from the UI
    • · Salting skewed keys
    • · Skew-aware joins
  5. 05

    Partitioning for Speed

    • · Too many vs too few partitions
    • · Repartition and coalesce
    • · Partition pruning on read
  6. 06

    Memory, Spills and OOMs

    • · Execution vs storage memory
    • · Diagnosing spills to disk
    • · Avoiding out-of-memory crashes
  7. 07

    Joins That Do Not Melt the Cluster

    • · Broadcast hash joins
    • · Sort-merge joins
    • · Adaptive query execution
  8. 08

    Right-Sizing the Cluster

    • · Executors, cores and memory
    • · Dynamic allocation
    • · Cost vs runtime trade-offs
  9. 09

    A Tuning Playbook

    • · A repeatable diagnostic process
    • · Before-and-after case studies
    • · Quick wins checklist

This is the full chapter list — exactly what you'll receive in the PDF.