Applied data science using Pyspark : learn the end-to-end predictive model-building cycle /
Discover the capabilities of PySpark and its application in the realm of data science. This comprehensive guide with hand-picked examples of daily use cases will walk you through the end-to-end predictive model-building cycle with the latest techniques and tricks of the trade. Applied Data Science U...
Clasificación: | Libro Electrónico |
---|---|
Autor principal: | |
Otros Autores: | , |
Formato: | Electrónico eBook |
Idioma: | Inglés |
Publicado: |
Berkeley, CA :
Apress,
2021.
|
Temas: | |
Acceso en línea: | Texto completo (Requiere registro previo con correo institucional) |
Tabla de Contenidos:
- Intro
- Table of Contents
- About the Authors
- About the Technical Reviewer
- Acknowledgments
- Foreword 1
- Foreword 2
- Foreword 3
- Introduction
- Chapter 1: Setting Up the PySpark Environment
- Local Installation using Anaconda
- Step 1: Install Anaconda
- Step 2: Conda Environment Creation
- Step 3: Download and Unpack Apache Spark
- Step 4: Install Java 8 or Later
- Step 5: Mac & Linux Users
- Step 6: Windows Users
- Step 7: Run PySpark
- Step 8: Jupyter Notebook Extension
- Docker-based Installation
- Why Do We Need to Use Docker?
- What Is Docker?
- Create a Simple Docker Image
- Download PySpark Docker
- Step-by-Step Approach to Understanding the Docker PySpark run Command
- Databricks Community Edition
- Create Databricks Account
- Create a New Cluster
- Create Notebooks
- How Do You Import Data Files into the Databricks Environment?
- Basic Operations
- Upload Data
- Access Data
- Calculate Pi
- Summary
- Chapter 2: PySpark Basics
- PySpark Background
- PySpark Resilient Distributed Datasets (RDDs) and DataFrames
- Data Manipulations
- Reading Data from a File
- Reading Data from Hive Table
- Reading Metadata
- Counting Records
- Subset Columns and View a Glimpse of the Data
- Missing Values
- One-Way Frequencies
- Sorting and Filtering One-Way Frequencies
- Casting Variables
- Descriptive Statistics
- Unique/Distinct Values and Counts
- Filtering
- Creating New Columns
- Deleting and Renaming Columns
- Summary
- Chapter 3: Utility Functions and Visualizations
- Additional Data Manipulations
- String Functions
- Registering DataFrames
- Window Functions
- Other Useful Functions
- Collect List
- Sampling
- Caching and Persisting
- Saving Data
- Pandas Support
- Joins
- Dropping Duplicates
- Data Visualizations
- Introduction to Machine Learning
- Summary
- Chapter 4: Variable Selection
- Exploratory Data Analysis
- Cardinality
- Missing Values
- Missing at Random (MAR)
- Missing Completely at Random (MCAR)
- Missing Not at Random (MNAR)
- Code 1: Cardinality Check
- Code 2: Missing Values Check
- Step 1: Identify Variable Types
- Step 2: Apply StringIndexer to Character Columns
- Step 3: Assemble Features
- Built-in Variable Selection Process: Without Target
- Principal Component Analysis
- Mechanics
- Singular Value Decomposition
- Built-in Variable Selection Process: With Target
- ChiSq Selector
- Model-based Feature Selection
- Custom-built Variable Selection Process
- Information Value Using Weight of Evidence
- Monotonic Binning Using Spearman Correlation
- How Do You Calculate the Spearman Correlation by Hand?
- How Is Spearman Correlation Used to Create Monotonic Bins for Continuous Variables?
- Custom Transformers
- Main Concepts in Pipelines
- Voting-based Selection
- Summary
- Chapter 5: Supervised Learning Algorithms
- Basics
- Regression
- Classification
- Loss Functions
- Optimizers