Learning PySpark : build data-intensive applications locally and deploy at scale using the combined powers of Python and Spark 2.0 /
Annotation
Clasificación: | Libro Electrónico |
---|---|
Autores principales: | , |
Otros Autores: | |
Formato: | Electrónico eBook |
Idioma: | Inglés |
Publicado: |
Birmingham, UK :
Packt Publishing,
2017.
|
Temas: | |
Acceso en línea: | Texto completo Texto completo |
Tabla de Contenidos:
- Cover
- Copyright
- Credits
- Foreword
- About the Authors
- About the Reviewer
- www.PacktPub.com
- Customer Feedback
- Table of Contents
- Preface
- Chapter 1: Understanding Spark
- What is Apache Spark?
- Spark Jobs and APIs
- Execution process
- Resilient Distributed Dataset
- DataFrames
- Datasets
- Catalyst Optimizer
- Project Tungsten
- Spark 2.0 architecture
- Unifying Datasets and DataFrames
- Introducing SparkSession
- Tungsten phase 2
- Structured streaming
- Continuous applications
- Summary
- Chapter 2: Resilient Distributed Datasets
- Internal workings of an RDD
- Creating RDDs
- Schema
- Reading from files
- Lambda expressions
- Global versus local scope
- Transformations
- The .map(...) transformation
- The .filter(...) transformation
- The .flatMap(...) transformation
- The .distinct(...) transformation
- The .sample(...) transformation
- The .leftOuterJoin(...) transformation
- The .repartition(...) transformation
- Actions
- The .take(...) method
- The .collect(...) method
- The .reduce(...) method
- The .count(...) method
- The .saveAsTextFile(...) method
- The .foreach(...) method
- Summary
- Chapter 3: DataFrames
- Python to RDD communications
- Catalyst Optimizer refresh
- Speeding up PySpark with DataFrames
- Creating DataFrames
- Generating our own JSON data
- Creating a DataFrame
- Creating a temporary table
- Simple DataFrame queries
- DataFrame API query
- SQL query
- Interoperating with RDDs
- Inferring the schema using reflection
- Programmatically specifying the schema
- Querying with the DataFrame API
- Number of rows
- Running filter statements
- Querying with SQL
- Number of rows
- Running filter statements using the where Clauses
- DataFrame scenario
- on-time flight performance
- Preparing the source datasets.
- Joining flight performance and airports
- Visualizing our flight-performance data
- Spark Dataset API
- Summary
- Chapter 4: Prepare Data for Modeling
- Checking for duplicates, missing observations, and outliers
- Duplicates
- Missing observations
- Outliers
- Getting familiar with your data
- Descriptive statistics
- Correlations
- Visualization
- Histograms
- Interactions between features
- Summary
- Chapter 5: Introducing MLlib
- Overview of the package
- Loading and transforming the data
- Getting to know your data
- Descriptive statistics
- Correlations
- Statistical testing
- Creating the final dataset
- Creating an RDD of LabeledPoints
- Splitting into training and testing
- Predicting infant survival
- Logistic regression in MLlib
- Selecting only the most predictable features
- Random forest in MLlib
- Summary
- Chapter 6: Introducing the ML Package
- Overview of the package
- Transformer
- Estimators
- Classification
- Regression
- Clustering
- Pipeline
- Predicting the chances of infant survival with ML
- Loading the data
- Creating transformers
- Creating an estimator
- Creating a pipeline
- Fitting the model
- Evaluating the performance of the model
- Saving the model
- Parameter hyper-tuning
- Grid search
- Train-validation splitting
- Other features of PySpark ML in action
- Feature extraction
- NLP
- related feature extractors
- Discretizing continuous variables
- Standardizing continuous variables
- Classification
- Clustering
- Finding clusters in the births dataset
- Topic mining
- Regression
- Summary
- Chapter 7: GraphFrames
- Introducing GraphFrames
- Installing GraphFrames
- Creating a library
- Preparing your flights dataset
- Building the graph
- Executing simple queries
- Determining the number of airports and trips.
- Determining the longest delay in this dataset
- Determining the number of delayed versus on-time/early flights
- What flights departing Seattle are most likely to have significant delays?
- What states tend to have significant delays departing from Seattle?
- Understanding vertex degrees
- Determining the top transfer airports
- Understanding motifs
- Determining airport ranking using PageRank
- Determining the most popular non-stop flights
- Using Breadth-First Search
- Visualizing flights using D3
- Summary
- Chapter 8: TensorFrames
- What is Deep Learning?
- The need for neural networks and Deep Learning
- What is feature engineering?
- Bridging the data and algorithm
- What is TensorFlow?
- Installing Pip
- Installing TensorFlow
- Matrix multiplication using constants
- Matrix multiplication using placeholders
- Running the model
- Running another model
- Discussion
- Introducing TensorFrames
- TensorFrames
- quick start
- Configuration and setup
- Launching a Spark cluster
- Creating a TensorFrames library
- Installing TensorFlow on your cluster
- Using TensorFlow to add a constant to an existing column
- Executing the Tensor graph
- Blockwise reducing operations example
- Building a DataFrame of vectors
- Analysing the DataFrame
- Computing elementwise sum and min of all vectors
- Summary
- Chapter 9: Polyglot Persistence with Blaze
- Installing Blaze
- Polyglot persistence
- Abstracting data
- Working with NumPy arrays
- Working with pandas' DataFrame
- Working with files
- Working with databases
- Interacting with relational databases
- Interacting with the MongoDB database
- Data operations
- Accessing columns
- Symbolic transformations
- Operations on columns
- Reducing data
- Joins
- Summary
- Chapter 10: Structured Streaming
- What is Spark Streaming?.
- Why do we need Spark Streaming?
- What is the Spark Streaming application data flow?
- Simple streaming application using DStreams
- A quick primer on global aggregations
- Introducing Structured Streaming
- Summary
- Chapter 11: Packaging Spark Applications
- The spark-submit command
- Command line parameters
- Deploying the app programmatically
- Configuring your SparkSession
- Creating SparkSession
- Modularizing code
- Structure of the module
- Calculating the distance between two points
- Converting distance units
- Building an egg
- User defined functions in Spark
- Submitting a job
- Monitoring execution
- Databricks Jobs
- Summary
- Index.