Cargando…

Data Pipelines with Apache Airflow.

Data Pipelines with Apache Airflow teaches you how to build and maintain effective data pipelines. You'll explore the most common usage patterns, including aggregating multiple data sources, connecting to and from data lakes, and cloud deployment. Part reference and part tutorial, this practica...

Descripción completa

Detalles Bibliográficos
Clasificación:	Libro Electrónico
Autor principal:	Ruiter, Julian de
Otros Autores:	Harenslak, Bas
Formato:	Electrónico eBook
Idioma:	Inglés
Publicado:	[Place of publication not identified] : Simon & Schuster : Manning, 2021.
Temas:	Data mining. Cloud computing. Programming languages (Electronic computers) Python (Computer program language) Big data. Machine learning. Electronic data processing. Information storage and retrieval systems > Scalability. Application program interfaces (Computer software) Data Mining Exploration de données (Informatique) Infonuagique. Python (Langage de programmation) Données volumineuses. Apprentissage automatique. Interfaces de programmation d'applications. APIs (interfaces)
Acceso en línea:	Texto completo (Requiere registro previo con correo institucional)

Tabla de Contenidos:

Intro
inside front cover
Data Pipelines with Apache Airflow
Copyright
brief contents
contents
front matter
preface
acknowledgments
Bas Harenslak
Julian de Ruiter
about this book
Who should read this book
How this book is organized: A road map
About the code
LiveBook discussion forum
about the authors
about the cover illustration
Part 1. Getting started
1 Meet Apache Airflow
1.1 Introducing data pipelines
1.1.1 Data pipelines as graphs
1.1.2 Executing a pipeline graph
1.1.3 Pipeline graphs vs. sequential scripts
1.1.4 Running pipeline using workflow managers
1.2 Introducing Airflow
1.2.1 Defining pipelines flexibly in (Python) code
1.2.2 Scheduling and executing pipelines
1.2.3 Monitoring and handling failures
1.2.4 Incremental loading and backfilling
1.3 When to use Airflow
1.3.1 Reasons to choose Airflow
1.3.2 Reasons not to choose Airflow
1.4 The rest of this book
Summary
2 Anatomy of an Airflow DAG
2.1 Collecting data from numerous sources
2.1.1 Exploring the data
2.2 Writing your first Airflow DAG
2.2.1 Tasks vs. operators
2.2.2 Running arbitrary Python code
2.3 Running a DAG in Airflow
2.3.1 Running Airflow in a Python environment
2.3.2 Running Airflow in Docker containers
2.3.3 Inspecting the Airflow UI
2.4 Running at regular intervals
2.5 Handling failing tasks
Summary
3 Scheduling in Airflow
3.1 An example: Processing user events
3.2 Running at regular intervals
3.2.1 Defining scheduling intervals
3.2.2 Cron-based intervals
3.2.3 Frequency-based intervals
3.3 Processing data incrementally
3.3.1 Fetching events incrementally
3.3.2 Dynamic time references using execution dates
3.3.3 Partitioning your data
3.4 Understanding Airflow's execution dates
3.4.1 Executing work in fixed-length intervals
3.5 Using backfilling to fill in past gaps
3.5.1 Executing work back in time
3.6 Best practices for designing tasks
3.6.1 Atomicity
3.6.2 Idempotency
Summary
4 Templating tasks using the Airflow context
4.1 Inspecting data for processing with Airflow
4.1.1 Determining how to load incremental data
4.2 Task context and Jinja templating
4.2.1 Templating operator arguments
4.2.2 What is available for templating?
4.2.3 Templating the PythonOperator
4.2.4 Providing variables to the PythonOperator
4.2.5 Inspecting templated arguments
4.3 Hooking up other systems
Summary
5 Defining dependencies between tasks
5.1 Basic dependencies
5.1.1 Linear dependencies
5.1.2 Fan-in/-out dependencies
5.2 Branching
5.2.1 Branching within tasks
5.2.2 Branching within the DAG
5.3 Conditional tasks
5.3.1 Conditions within tasks
5.3.2 Making tasks conditional
5.3.3 Using built-in operators
5.4 More about trigger rules

Data Pipelines with Apache Airflow.

Ejemplares similares