Cargando…

Mastering Apache Spark 2.x - Second Edition.

Advanced analytics on your Big Data with latest Apache Spark 2.xAbout This Book* An advanced guide with a combination of instructions and practical examples to extend the most up-to date Spark functionalities.* Extend your data processing capabilities to process huge chunk of data in minimum time us...

Descripción completa

Detalles Bibliográficos
Clasificación:Libro Electrónico
Autor principal: Kienzler, Romeo
Formato: Electrónico eBook
Idioma:Inglés
Publicado: Birmingham : Packt Publishing, 2017.
Edición:2nd ed.
Temas:
Acceso en línea:Texto completo
Tabla de Contenidos:
  • Cover; Copyright; Credits; About the Author; About the Reviewer; www.PacktPub.com; Customer Feedback; Table of Contents; Preface; Chapter 1: A First Taste and What's New in Apache Spark V2; Spark machine learning; Spark Streaming; Spark SQL; Spark graph processing; Extended ecosystem; What's new in Apache Spark V2?; Cluster design; Cluster management; Local; Standalone; Apache YARN; Apache Mesos; Cloud-based deployments; Performance; The cluster structure; Hadoop Distributed File System; Data locality; Memory; Coding; Cloud; Summary; Chapter 2: Apache Spark SQL.
  • The SparkSession
  • your gateway to structured data processingImporting and saving data; Processing the text files; Processing JSON files; Processing the Parquet files; Understanding the DataSource API; Implicit schema discovery; Predicate push-down on smart data sources; DataFrames; Using SQL; Defining schemas manually; Using SQL subqueries; Applying SQL table joins; Using Datasets; The Dataset API in action; User-defined functions; RDDs versus DataFrames versus Datasets; Summary; Chapter 3: The Catalyst Optimizer; Understanding the workings of the Catalyst Optimizer.
  • Managing temporary views with the catalog APIThe SQL abstract syntax tree; How to go from Unresolved Logical Execution Plan to Resolved Logical Execution Plan; Internal class and object representations of LEPs; How to optimize the Resolved Logical Execution Plan; Physical Execution Plan generation and selection; Code generation; Practical examples; Using the explain method to obtain the PEP; How smart data sources work internally; Summary; Chapter 4: Project Tungsten; Memory management beyond the Java Virtual Machine Garbage Collector; Understanding the UnsafeRow object.
  • The null bit set regionThe fixed length values region; The variable length values region; Understanding the BytesToBytesMap; A practical example on memory usage and performance; Cache-friendly layout of data in memory; Cache eviction strategies and pre-fetching; Code generation; Understanding columnar storage; Understanding whole stage code generation; A practical example on whole stage code generation performance; Operator fusing versus the volcano iterator model; Summary; Chapter 5: Apache Spark Streaming; Overview; Errors and recovery; Checkpointing; Streaming sources; TCP stream.
  • File streamsFlume; Kafka; Summary; Chapter 6: Structured Streaming; The concept of continuous applications; True unification
  • same code, same engine; Windowing; How streaming engines use windowing; How Apache Spark improves windowing; Increased performance with good old friends; How transparent fault tolerance and exactly-once delivery guarantee is achieved; Replayable sources can replay streams from a given offset; Idempotent sinks prevent data duplication; State versioning guarantees consistent results after reruns; Example
  • connection to a MQTT message broker.