Cover of Data Lakes on AWS
DRM-free · Yours to keep forever
Cloud & Infrastructure

Data Lakes on AWS

Designing a cost-effective lakehouse with S3, Glue and Athena

By Houssam Kodad

PDF 256 pages Intermediate English

One-time purchase

€29.95

VAT included
where applicable

Download sample
  • Instant download after purchase
  • Readable on any device
  • Free updates to this edition
  • Secure checkout

About this book

What's inside

AWS gives you a hundred ways to build a data platform and very little guidance on which to pick. This book lays out an opinionated, cost-effective lakehouse on S3, Glue and Athena, with Lake Formation for governance and open table formats for reliability. You'll learn the partitioning, file-format and security decisions that determine whether your lake stays fast and cheap or quietly becomes a swamp.

What you'll learn

Skills you'll walk away with

  • Lay out S3 for performance, cost and lifecycle
  • Catalog data with Glue and crawlers you can trust
  • Query at scale with Athena and partition projection
  • Adopt open table formats like Apache Iceberg
  • Govern access with Lake Formation and IAM
  • Build ETL with Glue jobs and orchestration
  • Keep query and storage costs under control

Table of contents

9 chapters
  1. 01

    A Lakehouse on AWS

    • · Lake vs warehouse vs lakehouse
    • · The S3-Glue-Athena core
    • · Where Redshift fits
  2. 02

    Designing S3 Storage

    • · Bucket and prefix layout
    • · Partitioning strategies
    • · Storage classes and lifecycle
  3. 03

    File Formats and Compression

    • · Parquet and columnar layout
    • · File sizing and small-file pain
    • · Compaction strategies
  4. 04

    The Glue Data Catalog

    • · Databases, tables and schemas
    • · Crawlers vs explicit schemas
    • · Schema evolution
  5. 05

    Querying with Athena

    • · SQL over the lake
    • · Partition projection
    • · Cost and performance tuning
  6. 06

    Open Table Formats

    • · Why Iceberg matters
    • · Upserts, deletes and time travel
    • · Migration considerations
  7. 07

    ETL with Glue

    • · Glue jobs and bookmarks
    • · Spark on Glue
    • · Orchestration with Step Functions
  8. 08

    Governance with Lake Formation

    • · Fine-grained access control
    • · Row and column security
    • · Cross-account sharing
  9. 09

    Cost and Operations

    • · Athena and storage cost levers
    • · Monitoring and logging
    • · A reliability checklist

This is the full chapter list — exactly what you'll receive in the PDF.