Cargando…

Beginning Apache Spark 3 : with DataFrame, Spark SQL, structured streaming, and Spark machine learning library /

Take a journey toward discovering, learning, and using Apache Spark 3.0. In this book, you will gain expertise on the powerful and efficient distributed data processing engine inside of Apache Spark; its user-friendly, comprehensive, and flexible programming model for processing data in batch and st...

Descripción completa

Detalles Bibliográficos
Clasificación:Libro Electrónico
Autor principal: Luu, Hien (Autor)
Formato: Electrónico eBook
Idioma:Inglés
Publicado: New York : Apress, 2021.
Edición:Second edition.
Temas:
Acceso en línea:Texto completo (Requiere registro previo con correo institucional)
Tabla de Contenidos:
  • Intro
  • Table of Contents
  • About the Author
  • About the Technical Reviewers
  • Acknowledgments
  • Introduction
  • Chapter 1: Introduction to Apache Spark
  • Overview
  • History
  • Spark Core Concepts and Architecture
  • Spark Cluster and Resource Management System
  • Spark Applications
  • Spark Drivers and Executors
  • Spark Unified Stack
  • Spark Core
  • Spark SQL
  • Spark Structured Streaming
  • Spark MLlib
  • Spark GraphX
  • SparkR
  • Apache Spark 3.0
  • Adaptive Query Execution Framework
  • Dynamic Partition Pruning (DPP)
  • Accelerator-aware Scheduler
  • Apache Spark Applications
  • Spark Example Applications
  • Apache Spark Ecosystem
  • Delta Lake
  • Koalas
  • MLflow
  • Summary
  • Chapter 2: Working with Apache Spark
  • Downloading and Installation
  • Downloading Spark
  • Installing Spark
  • Spark Scala Shell
  • Spark Python Shell
  • Having Fun with the Spark Scala Shell
  • Useful Spark Scala Shell Command and Tips
  • Basic Interactions with Scala and Spark
  • Basic Interactions with Scala
  • Spark UI and Basic Interactions with Spark
  • Spark UI
  • Basic Interactions with Spark
  • Introduction to Collaborative Notebooks
  • Create a Cluster
  • Create a Folder
  • Create a Notebook
  • Setting up Spark Source Code
  • Summary
  • Chapter 3: Spark SQL: Foundation
  • Understanding RDD
  • Introduction to the DataFrame API
  • Creating a DataFrame
  • Creating a DataFrame from RDD
  • Creating a DataFrame from a Range of Numbers
  • Creating a DataFrame from Data Sources
  • Creating a DataFrame by Reading Text Files
  • Creating a DataFrame by Reading CSV Files
  • Creating a DataFrame by Reading JSON Files
  • Creating a DataFrame by Reading Parquet Files
  • Creating a DataFrame by Reading ORC Files
  • Creating a DataFrame from JDBC
  • Working with Structured Operations
  • Working with Columns
  • Working with Structured Transformations
  • select(columns)
  • selectExpr(expressions)
  • filler(condition), where(condition)
  • distinct, dropDuplicates
  • sort(columns), orderBy(columns)
  • limit(n)
  • union(otherDataFrame)
  • withColumn(colName, column)
  • withColumnRenamed(existingColName, newColName)
  • drop(columnName1, columnName2)
  • sample(fraction), sample(fraction, seed), sample(fraction, seed, withReplacement)
  • randomSplit(weights)
  • Working with Missing or Bad Data
  • Working with Structured Actions
  • Describe(columnNames)
  • Introduction to Datasets
  • Creating Datasets
  • Working with Datasets
  • Using SQL in Spark SQL
  • Running SQL in Spark
  • Writing Data Out to Storage Systems
  • The Trio: DataFrame, Dataset, and SQL
  • DataFrame Persistence
  • Summary
  • Chapter 4: Spark SQL: Advanced
  • Aggregations
  • Aggregation Functions
  • Common Aggregation Functions
  • count(col)
  • countDistinct(col)
  • min(col), max(col)
  • sum(col)
  • sumDistinct(col)
  • avg(col)
  • skewness(col), kurtosis(col)
  • variance(col), stddev(col)
  • Aggregation with Grouping
  • Multiple Aggregations per Group