MODERN DATA ARCHITECTURES WITH PYTHON a practical guide to building and deploying data pipelines, data warehouses, and data lakes with Python /
Build scalable and reliable data ecosystems using Data Mesh, Databricks Spark, and Kafka Key Features Develop modern data skills used in emerging technologies Learn pragmatic design methodologies such as Data Mesh and data lakehouses Gain a deeper understanding of data governance Purchase of the pri...
Clasificación: | Libro Electrónico |
---|---|
Autor principal: | |
Formato: | Electrónico eBook |
Idioma: | Inglés |
Publicado: |
Birmingham, UK :
Packt Publishing Ltd.,
2023.
|
Edición: | 1st edition. |
Temas: | |
Acceso en línea: | Texto completo (Requiere registro previo con correo institucional) |
Tabla de Contenidos:
- Cover
- Title Page
- Copyright and Credits
- Dedications
- Contributors
- Table of Contents
- Preface
- Part 1: Fundamental Data Knowledge
- Chapter 1: Modern Data Processing Architecture
- Technical requirements
- Databases, data warehouses, and data lakes
- OLTP
- OLAP
- Data lakes
- Event stores
- File formats
- Data platform architecture at a high level
- Comparing the Lambda and Kappa architectures
- Lambda architecture
- Kappa architecture
- Lakehouse and Delta architectures
- Lakehouses
- The seven central tenets
- The medallion data pattern and the Delta architecture
- Data mesh theory and practice
- Defining terms
- The four principles of data mesh
- Summary
- Practical lab
- Solution
- Chapter 2: Understanding Data Analytics
- Technical requirements
- Setting up your environment
- Python
- venv
- Graphviz
- Workflow initialization
- Cleaning and preparing your data
- Duplicate values
- Working with nulls
- Using RegEx
- Outlier identification
- Casting columns
- Fixing column names
- Complex data types
- Data documentation
- diagrams
- Data lineage graphs
- Data modeling patterns
- Relational
- Dimensional modeling
- Key terms
- OBT
- Practical lab
- Loading the problem data
- Solution
- Summary
- Part 2: Data Engineering Toolset
- Chapter 3: Apache Spark Deep Dive
- Technical requirements
- Setting up your environment
- Python, AWS, and Databricks
- Databricks CLI
- Cloud data storage
- Object storage
- Relational
- NoSQL
- Spark architecture
- Introduction to Apache Spark
- Key components
- Working with partitions
- Shuffling partitions
- Caching
- Broadcasting
- Job creation pipeline
- Delta Lake
- Transaction log
- Grouping tables with databases
- Table
- Adding speed with Z-ordering
- Bloom filters
- Practical lab
- Problem 1
- Problem 2
- Problem 3
- Solution
- Summary
- Chapter 4: Batch and Stream Data Processing Using PySpark
- Technical requirements
- Setting up your environment
- Python, AWS, and Databricks
- Databricks CLI
- Batch processing
- Partitioning
- Data skew
- Reading data
- Spark schemas
- Making decisions
- Removing unwanted columns
- Working with data in groups
- The UDF
- Stream processing
- Reading from disk
- Debugging
- Writing to disk
- Batch stream hybrid
- Delta streaming
- Batch processing in a stream
- Practical lab
- Setup
- Creating fake data
- Problem 1
- Problem 2
- Problem 3
- Solution
- Solution 1
- Solution 2
- Solution 3
- Summary
- Chapter 5: Streaming Data with Kafka
- Technical requirements
- Setting up your environment
- Python, AWS, and Databricks
- Databricks CLI
- Confluent Kafka
- Signing up
- Kafka architecture
- Topics
- Partitions
- Brokers
- Producers
- Consumers
- Schema Registry
- Kafka Connect
- Spark and Kafka
- Practical lab
- Solution
- Summary