Beginning Apache Spark 3 : with DataFrame, Spark SQL, structured streaming, and Spark machine learning library /
Take a journey toward discovering, learning, and using Apache Spark 3.0. In this book, you will gain expertise on the powerful and efficient distributed data processing engine inside of Apache Spark; its user-friendly, comprehensive, and flexible programming model for processing data in batch and st...
Clasificación: | Libro Electrónico |
---|---|
Autor principal: | |
Formato: | Electrónico eBook |
Idioma: | Inglés |
Publicado: |
New York :
Apress,
2021.
|
Edición: | Second edition. |
Temas: | |
Acceso en línea: | Texto completo (Requiere registro previo con correo institucional) |
Tabla de Contenidos:
- Intro
- Table of Contents
- About the Author
- About the Technical Reviewers
- Acknowledgments
- Introduction
- Chapter 1: Introduction to Apache Spark
- Overview
- History
- Spark Core Concepts and Architecture
- Spark Cluster and Resource Management System
- Spark Applications
- Spark Drivers and Executors
- Spark Unified Stack
- Spark Core
- Spark SQL
- Spark Structured Streaming
- Spark MLlib
- Spark GraphX
- SparkR
- Apache Spark 3.0
- Adaptive Query Execution Framework
- Dynamic Partition Pruning (DPP)
- Accelerator-aware Scheduler
- Apache Spark Applications
- Spark Example Applications
- Apache Spark Ecosystem
- Delta Lake
- Koalas
- MLflow
- Summary
- Chapter 2: Working with Apache Spark
- Downloading and Installation
- Downloading Spark
- Installing Spark
- Spark Scala Shell
- Spark Python Shell
- Having Fun with the Spark Scala Shell
- Useful Spark Scala Shell Command and Tips
- Basic Interactions with Scala and Spark
- Basic Interactions with Scala
- Spark UI and Basic Interactions with Spark
- Spark UI
- Basic Interactions with Spark
- Introduction to Collaborative Notebooks
- Create a Cluster
- Create a Folder
- Create a Notebook
- Setting up Spark Source Code
- Summary
- Chapter 3: Spark SQL: Foundation
- Understanding RDD
- Introduction to the DataFrame API
- Creating a DataFrame
- Creating a DataFrame from RDD
- Creating a DataFrame from a Range of Numbers
- Creating a DataFrame from Data Sources
- Creating a DataFrame by Reading Text Files
- Creating a DataFrame by Reading CSV Files
- Creating a DataFrame by Reading JSON Files
- Creating a DataFrame by Reading Parquet Files
- Creating a DataFrame by Reading ORC Files
- Creating a DataFrame from JDBC
- Working with Structured Operations
- Working with Columns
- Working with Structured Transformations
- select(columns)
- selectExpr(expressions)
- filler(condition), where(condition)
- distinct, dropDuplicates
- sort(columns), orderBy(columns)
- limit(n)
- union(otherDataFrame)
- withColumn(colName, column)
- withColumnRenamed(existingColName, newColName)
- drop(columnName1, columnName2)
- sample(fraction), sample(fraction, seed), sample(fraction, seed, withReplacement)
- randomSplit(weights)
- Working with Missing or Bad Data
- Working with Structured Actions
- Describe(columnNames)
- Introduction to Datasets
- Creating Datasets
- Working with Datasets
- Using SQL in Spark SQL
- Running SQL in Spark
- Writing Data Out to Storage Systems
- The Trio: DataFrame, Dataset, and SQL
- DataFrame Persistence
- Summary
- Chapter 4: Spark SQL: Advanced
- Aggregations
- Aggregation Functions
- Common Aggregation Functions
- count(col)
- countDistinct(col)
- min(col), max(col)
- sum(col)
- sumDistinct(col)
- avg(col)
- skewness(col), kurtosis(col)
- variance(col), stddev(col)
- Aggregation with Grouping
- Multiple Aggregations per Group