Machine Learning with Spark - Second Edition.
Clasificación: | Libro Electrónico |
---|---|
Autor principal: | |
Otros Autores: | , |
Formato: | Electrónico eBook |
Idioma: | Inglés |
Publicado: |
Birmingham :
Packt Publishing,
2016.
|
Edición: | 2nd ed. |
Temas: | |
Acceso en línea: | Texto completo |
Tabla de Contenidos:
- Cover
- Credits
- About the Authors
- About the Reviewer
- www.PacktPub.com
- Customer Feedback
- Table of Contents
- Preface
- Chapter 1: Getting Up and Running with Spark
- Installing and setting up Spark locally
- Spark clusters
- The Spark programming model
- SparkContext and SparkConf
- SparkSession
- The Spark shell
- Resilient Distributed Datasets
- Creating RDDs
- Spark operations
- Caching RDDs
- Broadcast variables and accumulators
- SchemaRDD
- Spark data frame
- The first step to a Spark program in Scala
- The first step to a Spark program in Java
- The first step to a Spark program in Python
- The first step to a Spark program in R
- SparkR DataFrames
- Getting Spark running on Amazon EC2
- Launching an EC2 Spark cluster
- Configuring and running Spark on Amazon Elastic Map Reduce
- UI in Spark
- Supported machine learning algorithms by Spark
- Benefits of using Spark ML as compared to existing libraries
- Spark Cluster on Google Compute Engine
- DataProc
- Hadoop and Spark Versions
- Creating a Cluster
- Submitting a Job
- Summary
- Chapter 2: Math for Machine Learning
- Linear algebra
- Setting up the Scala environment in Intellij
- Setting up the Scala environment on the Command Line
- Fields
- Real numbers
- Complex numbers
- Vectors
- Vector spaces
- Vector types
- Vectors in Breeze
- Vectors in Spark
- Vector operations
- Hyperplanes
- Vectors in machine learning
- Matrix
- Types of matrices
- Matrix in Spark
- Distributed matrix in Spark
- Matrix operations
- Determinant
- Eigenvalues and eigenvectors
- Singular value decomposition
- Matrices in machine learning
- Functions
- Function types
- Functional composition
- Hypothesis
- Gradient descent
- Prior, likelihood, and posterior
- Calculus
- Differential calculus
- Integral calculus.
- Lagranges multipliers
- Plotting
- Summary
- Chapter 3: Designing a Machine Learning System
- What is Machine Learning?
- Introducing MovieStream
- Business use cases for a machine learning system
- Personalization
- Targeted marketing and customer segmentation
- Predictive modeling and analytics
- Types of machine learning models
- The components of a data-driven machine learning system
- Data ingestion and storage
- Data cleansing and transformation
- Model training and testing loop
- Model deployment and integration
- Model monitoring and feedback
- Batch versus real time
- Data Pipeline in Apache Spark
- An architecture for a machine learning system
- Spark MLlib
- Performance improvements in Spark ML over Spark MLlib
- Comparing algorithms supported by MLlib
- Classification
- Clustering
- Regression
- MLlib supported methods and developer APIs
- Spark Integration
- MLlib vision
- MLlib versions compared
- Spark 1.6 to 2.0
- Summary
- Chapter 4: Obtaining, Processing, and Preparing Data with Spark
- Accessing publicly available datasets
- The MovieLens 100k dataset
- Exploring and visualizing your data
- Exploring the user dataset
- Count by occupation
- Movie dataset
- Exploring the rating dataset
- Rating count bar chart
- Distribution of number ratings
- Processing and transforming your data
- Filling in bad or missing data
- Extracting useful features from your data
- Numerical features
- Categorical features
- Derived features
- Transforming timestamps into categorical features
- Extract time of Day
- Extract time of day
- Text features
- Simple text feature extraction
- Sparse Vectors from Titles
- Normalizing features
- Using ML for feature normalization
- Using packages for feature extraction
- TFID
- IDF
- Word2Vector
- Skip-gram model
- Standard scalar
- Summary.
- Chapter 5: Building a Recommendation Engine with Spark
- Types of recommendation models
- Content-based filtering
- Collaborative filtering
- Matrix factorization
- Explicit matrix factorization
- Implicit Matrix Factorization
- Basic model for Matrix Factorization
- Alternating least squares
- Extracting the right features from your data
- Extracting features from the MovieLens 100k dataset
- Training the recommendation model
- Training a model on the MovieLens 100k dataset
- Training a model using Implicit feedback data
- Using the recommendation model
- ALS Model recommendations
- User recommendations
- Generating movie recommendations from the MovieLens 100k dataset
- Inspecting the recommendations
- Item recommendations
- Generating similar movies for the MovieLens 100k dataset
- Inspecting the similar items
- Evaluating the performance of recommendation models
- ALS Model Evaluation
- Mean Squared Error
- Mean Average Precision at K
- Using MLlib's built-in evaluation functions
- RMSE and MSE
- MAP
- FP-Growth algorithm
- FP-Growth Basic Sample
- FP-Growth Applied to Movie Lens Data
- Summary
- Chapter 6: Building a Classification Model with Spark
- Types of classification models
- Linear models
- Logistic regression
- Multinomial logistic regression
- Visualizing the StumbleUpon dataset
- Extracting features from the Kaggle/StumbleUpon evergreen classification dataset
- StumbleUponExecutor
- Linear support vector machines
- The naive Bayes model
- Decision trees
- Ensembles of trees
- Random Forests
- Gradient-Boosted Trees
- Multilayer perceptron classifier
- Extracting the right features from your data
- Training classification models
- Training a classification model on the Kaggle/StumbleUpon evergreen classification dataset
- Using classification models.
- Generating predictions for the Kaggle/StumbleUpon evergreen classification dataset
- Evaluating the performance of classification models
- Accuracy and prediction error
- Precision and recall
- ROC curve and AUC
- Improving model performance and tuning parameters
- Feature standardization
- Additional features
- Using the correct form of data
- Tuning model parameters
- Linear models
- Iterations
- Step size
- Regularization
- Decision trees
- Tuning tree depth and impurity
- The naive Bayes model
- Cross-validation
- Summary
- Chapter 7: Building a Regression Model with Spark
- Types of regression models
- Least squares regression
- Decision trees for regression
- Evaluating the performance of regression models
- Mean Squared Error and Root Mean Squared Error
- Mean Absolute Error
- Root Mean Squared Log Error
- The R-squared coefficient
- Extracting the right features from your data
- Extracting features from the bike sharing dataset
- Training and using regression models
- BikeSharingExecutor
- Training a regression model on the bike sharing dataset
- Linear regression
- Generalized linear regression
- Decision tree regression
- Ensembles of trees
- Random forest regression
- Gradient boosted tree regression
- Improving model performance and tuning parameters
- Transforming the target variable
- Impact of training on log-transformed targets
- Tuning model parameters
- Creating training and testing sets to evaluate parameters
- Splitting data for Decision tree
- The impact of parameter settings for linear models
- Iterations
- Step size
- L2 regularization
- L1 regularization
- Intercept
- The impact of parameter settings for the decision tree
- Tree depth
- Maximum bins
- The impact of parameter settings for the Gradient Boosted Trees
- Iterations
- MaxBins
- Summary.
- Chapter 8: Building a Clustering Model with Spark
- Types of clustering models
- k-means clustering
- Initialization methods
- Mixture models
- Hierarchical clustering
- Extracting the right features from your data
- Extracting features from the MovieLens dataset
- K-means
- training a clustering model
- Training a clustering model on the MovieLens dataset
- K-means
- interpreting cluster predictions on the MovieLens dataset
- Interpreting the movie clusters
- Interpreting the movie clusters
- K-means
- evaluating the performance of clustering models
- Internal evaluation metrics
- External evaluation metrics
- Computing performance metrics on the MovieLens dataset
- Effect of iterations on WSSSE
- Bisecting KMeans
- Bisecting K-means
- training a clustering model
- WSSSE and iterations
- Gaussian Mixture Model
- Clustering using GMM
- Plotting the user and item data with GMM clustering
- GMM
- effect of iterations on cluster boundaries
- Summary
- Chapter 9: Dimensionality Reduction with Spark
- Types of dimensionality reduction
- Principal components analysis
- Singular value decomposition
- Relationship with matrix factorization
- Clustering as dimensionality reduction
- Extracting the right features from your data
- Extracting features from the LFW dataset
- Exploring the face data
- Visualizing the face data
- Extracting facial images as vectors
- Loading images
- Converting to grayscale and resizing the images
- Extracting feature vectors
- Normalization
- Training a dimensionality reduction model
- Running PCA on the LFW dataset
- Visualizing the Eigenfaces
- Interpreting the Eigenfaces
- Using a dimensionality reduction model
- Projecting data using PCA on the LFW dataset
- The relationship between PCA and SVD
- Evaluating dimensionality reduction models.