Cargando…

Big Data Analytics.

A handy reference guide for data analysts and data scientists to help to obtain value from big data analytics using Spark on Hadoop clustersAbout This Book This book is based on the latest 2.0 version of Apache Spark and 2.7 version of Hadoop integrated with most commonly used tools. Learn all Spark...

Descripción completa

Detalles Bibliográficos
Clasificación:Libro Electrónico
Autor principal: Ankam, Venkat (Autor)
Formato: Electrónico eBook
Idioma:Inglés
Publicado: Packt Publishing 2016.
Temas:
Acceso en línea:Texto completo
Tabla de Contenidos:
  • Cover; Copyright; Credits; About the Author; Acknowledgement; About the Reviewers; www.PacktPub.com; Preface; Chapter 1: Big Data Analytics at a 10,000-Foot View; Big Data analytics and the role of Hadoop and Spark; A typical Big Data analytics project life cycle; Identifying the problem and outcomes; Identifying the necessary data; Data collection; Preprocessing data and ETL; Performing analytics; Visualizing data; The role of Hadoop and Spark; Big Data science and the role of Hadoop and Spark; A fundamental shift from data analytics to data science; Data scientists versus software engineers.
  • Data scientists versus data analystsData scientists versus business analysts; A typical data science project life cycle; Hypothesis and modeling; Measuring the effectiveness; Making improvements; Communicating the results; The role of Hadoop and Spark; Tools and techniques; Real-life use cases; Summary; Chapter 2: Getting Started with Apache Hadoop and Apache Spark; Introducing Apache Hadoop; Hadoop Distributed File System; Features of HDFS; MapReduce; MapReduce features; MapReduce v1 versus MapReduce v2; MapReduce v1 challenges; YARN; Storage options on Hadoop; File formats.
  • Compression formatsIntroducing Apache Spark; Spark history; What is Apache Spark?; What Apache Spark is not; MapReduce issues; Spark's stack; Why Hadoop plus Spark?; Hadoop features; Spark features; Frequently asked questions about Spark; Installing Hadoop plus Spark clusters; Summary; Chapter 3: Deep Dive into Apache Spark; Starting Spark daemons; Working with CDH; Working with HDP, MapR, and Spark pre-built packages; Learning Spark core concepts; Ways to work with Spark; Spark Shell; Spark applications; Resilient Distributed Dataset; Method 1
  • parallelizing a collection.
  • Method 2
  • reading from a fileSpark context; Transformations and actions; Parallelism in RDDs; Lazy evaluation; Lineage Graph; Serialization; Leveraging Hadoop file formats in Spark; Data locality; Shared variables; Pair RDDs; Lifecycle of Spark program; Pipelining; Spark execution summary; Spark applications; Spark Shell versus Spark applications; Creating a Spark context; SparkConf; SparkSubmit; Spark Conf precedence order; Important application configurations; Persistence and caching; Storage levels; What level to choose?; Spark resource managers
  • Standalone, YARN, and Mesos.
  • Local versus cluster modeCluster resource managers; Standalone; YARN; Mesos; Which resource manager to use?; Summary; Chapter 4: Big Data Analytics with Spark SQL, DataFrames, and Datasets; History of Spark SQL; Architecture of Spark SQL; Introducing SQL, Datasources, DataFrame, and Dataset APIs; Evolution of DataFrames and Datasets; What's wrong with RDDs?; RDD Transformations versus Dataset and DataFrames Transformations; Why Datasets and DataFrames?; Optimization; Speed; Automatic Schema Discovery; Multiple sources, multiple languages; Interoperability between RDDs and others.