Cargando…

Mastering Spark for data science /

"Master the techniques and sophisticated analytics used to construct Spark-based solutions that scale to deliver production-grade data science products."

Detalles Bibliográficos
Clasificación:Libro Electrónico
Autor principal: Morgan, Andrew
Otros Autores: Amend, Antoine, George, David, Hallett, Matthew
Formato: Electrónico eBook
Idioma:Inglés
Publicado: Birmingham, UK : Packt Publishing Ltd., 2017.
Temas:
Acceso en línea:Texto completo
Tabla de Contenidos:
  • Cover; Copyright; Credits; Foreword; About the Authors; About the Reviewer; www.PacktPub.com; Customer Feedback; Table of Contents; Preface; Chapter 1: The Big Data Science Ecosystem; Introducing the Big Data ecosystem; Data management; Data management responsibilities; The right tool for the job; Overall architecture; Data Ingestion; Data Lake; Reliable storage; Scalable data processing capability; Data science platform; Data Access; Data technologies; The role of Apache Spark; Companion tools; Apache HDFS; Advantages; Disadvantages; Installation; Amazon S3; Advantages; Disadvantages.
  • InstallationApache Kafka; Advantages; Disadvantages; Installation; Apache Parquet; Advantages; Disadvantages; Installation; Apache Avro; Advantages; Disadvantages; Installation; Apache NiFi; Advantages; Disadvantages; Installation; Apache YARN; Advantages; Disadvantages; Installation; Apache Lucene; Advantages; Disadvantages; Installation; Kibana; Advantages; Disadvantages; Installation; Elasticsearch; Advantages; Disadvantages; Installation; Accumulo; Advantages; Disadvantages; Installation; Summary; Chapter 2: Data Acquisition; Data pipelines; Universal ingestion framework.
  • Introducing the GDELT news streamDiscovering GDELT in real-time; Our first GDELT feed; Improving with publish and subscribe; Content registry; Choices and more choices; Going with the flow; Metadata model; Kibana dashboard; Quality assurance; [Example 1
  • Basic quality checking, no contending users]; Example 1
  • Basic quality checking, no contending users; Example 2
  • Advanced quality checking, no contending users; Example 3
  • Basic quality checking, 50% utility due to contending users; Summary; Chapter 3: Input Formats and Schema; A structured life is a good life; GDELT dimensional modeling.
  • GDELT modelFirst look at the data; Core global knowledge graph model; Hidden complexity; Denormalized models; Challenges with flattened data; Issue 1
  • Loss of contextual information; Issue 2: Re-establishing dimensions; Issue 3: Including reference data; Loading your data; Schema agility; Reality check; GKG ELT; Position matters; Avro; Spark-Avro method; Pedagogical method; When to perform Avro transformation; Parquet; Summary; Chapter 4: Exploratory Data Analysis; The problem, principles and planning; Understanding the EDA problem; Design principles; General plan of exploration; Preparation.
  • Introducing mask based data profilingIntroducing character class masks; Building a mask based profiler; Setting up Apache Zeppelin; Constructing a reusable notebook; Exploring GDELT; GDELT GKG datasets; The files; Special collections; Reference data; Exploring the GKG v2.1; The Translingual files; A configurable GCAM time series EDA; Plot.ly charting on Apache Zeppelin; Exploring translation sourced GCAM sentiment with plot.ly; Concluding remarks; A configurable GCAM Spatio-Temporal EDA; Introducing GeoGCAM; Does our spatial pivot work?; Summary; Chapter 5: Spark for Geographic Analysis.