Cargando…

Data Pipelines with Apache Airflow.

Data Pipelines with Apache Airflow teaches you how to build and maintain effective data pipelines. You'll explore the most common usage patterns, including aggregating multiple data sources, connecting to and from data lakes, and cloud deployment. Part reference and part tutorial, this practica...

Descripción completa

Detalles Bibliográficos
Clasificación:Libro Electrónico
Autor principal: Ruiter, Julian de
Otros Autores: Harenslak, Bas
Formato: Electrónico eBook
Idioma:Inglés
Publicado: [Place of publication not identified] : Simon & Schuster : Manning, 2021.
Temas:
Acceso en línea:Texto completo (Requiere registro previo con correo institucional)
Tabla de Contenidos:
  • Intro
  • inside front cover
  • Data Pipelines with Apache Airflow
  • Copyright
  • brief contents
  • contents
  • front matter
  • preface
  • acknowledgments
  • Bas Harenslak
  • Julian de Ruiter
  • about this book
  • Who should read this book
  • How this book is organized: A road map
  • About the code
  • LiveBook discussion forum
  • about the authors
  • about the cover illustration
  • Part 1. Getting started
  • 1 Meet Apache Airflow
  • 1.1 Introducing data pipelines
  • 1.1.1 Data pipelines as graphs
  • 1.1.2 Executing a pipeline graph
  • 1.1.3 Pipeline graphs vs. sequential scripts
  • 1.1.4 Running pipeline using workflow managers
  • 1.2 Introducing Airflow
  • 1.2.1 Defining pipelines flexibly in (Python) code
  • 1.2.2 Scheduling and executing pipelines
  • 1.2.3 Monitoring and handling failures
  • 1.2.4 Incremental loading and backfilling
  • 1.3 When to use Airflow
  • 1.3.1 Reasons to choose Airflow
  • 1.3.2 Reasons not to choose Airflow
  • 1.4 The rest of this book
  • Summary
  • 2 Anatomy of an Airflow DAG
  • 2.1 Collecting data from numerous sources
  • 2.1.1 Exploring the data
  • 2.2 Writing your first Airflow DAG
  • 2.2.1 Tasks vs. operators
  • 2.2.2 Running arbitrary Python code
  • 2.3 Running a DAG in Airflow
  • 2.3.1 Running Airflow in a Python environment
  • 2.3.2 Running Airflow in Docker containers
  • 2.3.3 Inspecting the Airflow UI
  • 2.4 Running at regular intervals
  • 2.5 Handling failing tasks
  • Summary
  • 3 Scheduling in Airflow
  • 3.1 An example: Processing user events
  • 3.2 Running at regular intervals
  • 3.2.1 Defining scheduling intervals
  • 3.2.2 Cron-based intervals
  • 3.2.3 Frequency-based intervals
  • 3.3 Processing data incrementally
  • 3.3.1 Fetching events incrementally
  • 3.3.2 Dynamic time references using execution dates
  • 3.3.3 Partitioning your data
  • 3.4 Understanding Airflow's execution dates
  • 3.4.1 Executing work in fixed-length intervals
  • 3.5 Using backfilling to fill in past gaps
  • 3.5.1 Executing work back in time
  • 3.6 Best practices for designing tasks
  • 3.6.1 Atomicity
  • 3.6.2 Idempotency
  • Summary
  • 4 Templating tasks using the Airflow context
  • 4.1 Inspecting data for processing with Airflow
  • 4.1.1 Determining how to load incremental data
  • 4.2 Task context and Jinja templating
  • 4.2.1 Templating operator arguments
  • 4.2.2 What is available for templating?
  • 4.2.3 Templating the PythonOperator
  • 4.2.4 Providing variables to the PythonOperator
  • 4.2.5 Inspecting templated arguments
  • 4.3 Hooking up other systems
  • Summary
  • 5 Defining dependencies between tasks
  • 5.1 Basic dependencies
  • 5.1.1 Linear dependencies
  • 5.1.2 Fan-in/-out dependencies
  • 5.2 Branching
  • 5.2.1 Branching within tasks
  • 5.2.2 Branching within the DAG
  • 5.3 Conditional tasks
  • 5.3.1 Conditions within tasks
  • 5.3.2 Making tasks conditional
  • 5.3.3 Using built-in operators
  • 5.4 More about trigger rules