Mastering Spark for data science /
"Master the techniques and sophisticated analytics used to construct Spark-based solutions that scale to deliver production-grade data science products."
Clasificación: | Libro Electrónico |
---|---|
Autor principal: | |
Otros Autores: | , , |
Formato: | Electrónico eBook |
Idioma: | Inglés |
Publicado: |
Birmingham, UK :
Packt Publishing Ltd.,
2017.
|
Temas: | |
Acceso en línea: | Texto completo |
Tabla de Contenidos:
- Cover; Copyright; Credits; Foreword; About the Authors; About the Reviewer; www.PacktPub.com; Customer Feedback; Table of Contents; Preface; Chapter 1: The Big Data Science Ecosystem; Introducing the Big Data ecosystem; Data management; Data management responsibilities; The right tool for the job; Overall architecture; Data Ingestion; Data Lake; Reliable storage; Scalable data processing capability; Data science platform; Data Access; Data technologies; The role of Apache Spark; Companion tools; Apache HDFS; Advantages; Disadvantages; Installation; Amazon S3; Advantages; Disadvantages.
- InstallationApache Kafka; Advantages; Disadvantages; Installation; Apache Parquet; Advantages; Disadvantages; Installation; Apache Avro; Advantages; Disadvantages; Installation; Apache NiFi; Advantages; Disadvantages; Installation; Apache YARN; Advantages; Disadvantages; Installation; Apache Lucene; Advantages; Disadvantages; Installation; Kibana; Advantages; Disadvantages; Installation; Elasticsearch; Advantages; Disadvantages; Installation; Accumulo; Advantages; Disadvantages; Installation; Summary; Chapter 2: Data Acquisition; Data pipelines; Universal ingestion framework.
- Introducing the GDELT news streamDiscovering GDELT in real-time; Our first GDELT feed; Improving with publish and subscribe; Content registry; Choices and more choices; Going with the flow; Metadata model; Kibana dashboard; Quality assurance; [Example 1
- Basic quality checking, no contending users]; Example 1
- Basic quality checking, no contending users; Example 2
- Advanced quality checking, no contending users; Example 3
- Basic quality checking, 50% utility due to contending users; Summary; Chapter 3: Input Formats and Schema; A structured life is a good life; GDELT dimensional modeling.
- GDELT modelFirst look at the data; Core global knowledge graph model; Hidden complexity; Denormalized models; Challenges with flattened data; Issue 1
- Loss of contextual information; Issue 2: Re-establishing dimensions; Issue 3: Including reference data; Loading your data; Schema agility; Reality check; GKG ELT; Position matters; Avro; Spark-Avro method; Pedagogical method; When to perform Avro transformation; Parquet; Summary; Chapter 4: Exploratory Data Analysis; The problem, principles and planning; Understanding the EDA problem; Design principles; General plan of exploration; Preparation.
- Introducing mask based data profilingIntroducing character class masks; Building a mask based profiler; Setting up Apache Zeppelin; Constructing a reusable notebook; Exploring GDELT; GDELT GKG datasets; The files; Special collections; Reference data; Exploring the GKG v2.1; The Translingual files; A configurable GCAM time series EDA; Plot.ly charting on Apache Zeppelin; Exploring translation sourced GCAM sentiment with plot.ly; Concluding remarks; A configurable GCAM Spatio-Temporal EDA; Introducing GeoGCAM; Does our spatial pivot work?; Summary; Chapter 5: Spark for Geographic Analysis.