Cargando…

Apache Spark Quick Start Guide : Quickly Learn the Art of Writing Efficient Big Data Applications with Apache Spark.

Apache Spark is a flexible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of...

Descripción completa

Detalles Bibliográficos
Clasificación:Libro Electrónico
Autor principal: Mehrotra, Shrey
Otros Autores: Grade, Akash
Formato: Electrónico eBook
Idioma:Inglés
Publicado: Birmingham : Packt Publishing Ltd, 2019.
Temas:
Acceso en línea:Texto completo
Tabla de Contenidos:
  • Cover; Title Page; Copyright and Credits; About Packt; Contributors; Table of Contents; Preface; Chapter 1: Introduction to Apache Spark; What is Spark?; Spark architecture overview; Spark language APIs; Scala; Java; Python; R; SQL; Spark components; Spark Core; Spark SQL; Spark Streaming; Spark machine learning; Spark graph processing; Cluster manager; Standalone scheduler; YARN; Mesos; Kubernetes; Making the most of Hadoop and Spark; Summary; Chapter 2: Apache Spark Installation; AWS elastic compute cloud (EC2); Creating a free account on AWS; Connecting to your Linux instance
  • Configuring SparkPrerequisites; Installing Java; Installing Scala; Installing Python; Installing Spark; Using Spark components; Different modes of execution; Spark sandbox; Summary; Chapter 3: Spark RDD; What is an RDD?; Resilient metadata; Programming using RDDs; Transformations and actions; Transformation; Narrow transformations; map(); flatMap(); filter(); union(); mapPartitions(); Wide transformations; distinct(); sortBy(); intersection(); subtract(); cartesian(); Action; collect(); count(); take(); top(); takeOrdered(); first(); countByValue(); reduce(); saveAsTextFile(); foreach()
  • Types of RDDsPair RDDs; groupByKey(); reduceByKey(); sortByKey(); join(); Caching and checkpointing; Caching; Checkpointing ; Understanding partitions ; repartition() versus coalesce(); partitionBy(); Drawbacks of using RDDs; Summary; Chapter 4: Spark DataFrame and Dataset; DataFrames; Creating DataFrames; Data sources; DataFrame operations and associated functions; Running SQL on DataFrames; Temporary views on DataFrames; Global temporary views on DataFrames; Datasets; Encoders; Internal row; Creating custom encoders; Summary; Chapter 5: Spark Architecture and Application Execution Flow
  • A sample applicationDAG constructor; Stage; Tasks; Task scheduler; FIFO; FAIR; Application execution modes; Local mode; Client mode; Cluster mode; Application monitoring; Spark UI; Application logs; External monitoring solution; Summary; Chapter 6: Spark SQL; Spark SQL; Spark metastore; Using the Hive metastore in Spark SQL; Hive configuration with Spark; SQL language manual; Database; Table and view; Load data; Creating UDFs; SQL database using JDBC; Summary; Chapter 7: Spark Streaming, Machine Learning, and Graph Analysis; Spark Streaming; Use cases; Data sources; Stream processing
  • MicrobatchDStreams; Streaming architecture; Streaming example; Machine learning; MLlib; ML; Graph processing; GraphX; mapVertices; mapEdges; subgraph; GraphFrames; degrees; subgraphs; Graph algorithms; PageRank; Summary; Chapter 8: Spark Optimizations; Cluster-level optimizations; Memory; Disk; CPU cores; Project Tungsten; Application optimizations; Language choice; Structured versus unstructured APIs; File format choice; RDD optimizations; Choosing the right transformations; Serializing and compressing ; Broadcast variables; DataFrame and dataset optimizations; Catalyst optimizer; Storage