Frank Kane's Taming big data with Apache Spark and Python : real-world examples to help you analyze large datasets with Apache Spark /
Frank Kane's hands-on Spark training course, based on his bestselling Taming Big Data with Apache Spark and Python video, now available in a book. Understand and analyze large data sets using Spark on a single system or on a cluster. About This Book* Understand how Spark can be distributed acro...
Clasificación: | Libro Electrónico |
---|---|
Autor principal: | |
Formato: | Electrónico eBook |
Idioma: | Inglés |
Publicado: |
Birmingham, UK :
Packt Publishing,
2017.
|
Temas: | |
Acceso en línea: | Texto completo (Requiere registro previo con correo institucional) |
Tabla de Contenidos:
- Cover
- Copyright
- Credits
- About the Author
- www.PacktPub.com
- Customer Feedback
- Table of Contents
- Preface
- Chapter 1: Getting Started with Spark
- Getting set up
- installing Python, a JDK, and Spark and its dependencies
- Installing Enthought Canopy
- Installing the Java Development Kit
- Installing Spark
- Running Spark code
- Installing the MovieLens movie rating dataset
- Run your first Spark program
- the ratings histogram example
- Examining the ratings counter script
- Running the ratings counter script
- Summary
- Chapter 2: Spark Basics and Spark Examples
- What is Spark?
- Spark is scalable
- Spark is fast
- Spark is hot
- Spark is not that hard
- Components of Spark
- Using Python with Spark
- The Resilient Distributed Dataset (RDD)
- What is the RDD?
- The SparkContext object
- Creating RDDs
- Transforming RDDs
- Map example
- RDD actions
- Ratings histogram walk-through
- Understanding the code
- Setting up the SparkContext object
- Loading the data
- Extract (MAP) the data we care about
- Perform an action
- count by value
- Sort and display the results
- Looking at the ratings-counter script in Canopy
- Key/value RDDs and the average friends by age example
- Key/value concepts
- RDDs can hold key/value pairs
- Creating a key/value RDD
- What Spark can do with key/value data?
- Mapping the values of a key/value RDD
- The friends by age example
- Parsing (mapping) the input data
- Counting up the sum of friends and number of entries per age
- Compute averages
- Collect and display the results
- Running the average friends by age example
- Examining the script
- Running the code
- Filtering RDDs and the minimum temperature by location example
- What is filter()
- The source data for the minimum temperature by location example
- Parse (map) the input data.
- Filter out all but the TMIN entries
- Create (station ID, temperature) key/value pairs
- Find minimum temperature by station ID
- Collect and print results
- Running the minimum temperature example and modifying it for maximums
- Examining the min-temperatures script
- Running the script
- Running the maximum temperature by location example
- Counting word occurrences using flatmap()
- Map versus flatmap
- Map ()
- Flatmap ()
- Code sample
- count the words in a book
- Improving the word-count script with regular expressions
- Text normalization
- Examining the use of regular expressions in the word-count script
- Running the code
- Sorting the word count results
- Step 1
- Implement countByValue() the hard way to create a new RDD
- Step 2
- Sort the new RDD
- Examining the script
- Running the code
- Find the total amount spent by customer
- Introducing the problem
- Strategy for solving the problem
- Useful snippets of code
- Check your results and sort them by the total amount spent
- Check your sorted implementation and results against mine
- Summary
- Chapter 3: Advanced Examples of Spark Programs
- Finding the most popular movie
- Examining the popular-movies script
- Getting results
- Using broadcast variables to display movie names instead of ID numbers
- Introducing broadcast variables
- Examining the popular-movies-nicer.py script
- Getting results
- Finding the most popular superhero in a social graph
- Superhero social networks
- Input data format
- Strategy
- Running the script
- discover who the most popular superhero is
- Mapping input data to (hero ID, number of co-occurrences) per line
- Adding up co-occurrence by hero ID
- Flipping the (map) RDD to (number, hero ID)
- Using max() and looking up the name of the winner
- Getting results.
- Superhero degrees of separation
- introducing the breadth-first search algorithm
- Degrees of separation
- How the breadth-first search algorithm works?
- The initial condition of our social graph
- First pass through the graph
- Second pass through the graph
- Third pass through the graph
- Final pass through the graph
- Accumulators and implementing BFS in Spark
- Convert the input file into structured data
- Writing code to convert Marvel-Graph.txt to BFS nodes
- Iteratively process the RDD
- Using a mapper and a reducer
- How do we know when we're done?
- Superhero degrees of separation
- review the code and run it
- Setting up an accumulator and using the convert to BFS function
- Calling flatMap()
- Calling an action
- Calling reduceByKey
- Getting results
- Item-based collaborative filtering in Spark, cache(), and persist()
- How does item-based collaborative filtering work?
- Making item-based collaborative filtering a Spark problem
- It's getting real
- Caching RDDs
- Running the similar-movies script using Spark's cluster manager
- Examining the script
- Getting results
- Improving the quality of the similar movies example
- Summary
- Chapter 4: Running Spark on a Cluster
- Introducing Elastic MapReduce
- Why use Elastic MapReduce?
- Warning
- Spark on EMR is not cheap
- Setting up our Amazon Web Services / Elastic MapReduce account and PuTTY
- Partitioning
- Using .partitionBy()
- Choosing a partition size
- Creating similar movies from one million ratings
- part 1
- Changes to the script
- Creating similar movies from one million ratings
- part 2
- Our strategy
- Specifying memory per executor
- Specifying a cluster manager
- Running on a cluster
- Setting up to run the movie-similarities-1m.py script on a cluster
- Preparing the script
- Creating a cluster.
- Connecting to the master node using SSH
- Running the code
- Creating similar movies from one million ratings
- part 3
- Assessing the results
- Terminating the cluster
- Troubleshooting Spark on a cluster
- More troubleshooting and managing dependencies
- Troubleshooting
- Managing dependencies
- Summary
- Chapter 5: SparkSQL, DataFrames, and DataSets
- Introducing SparkSQL
- Using SparkSQL in Python
- More things you can do with DataFrames
- Differences between DataFrames and DataSets
- Shell access in SparkSQL
- User-defined functions (UDFs)
- Executing SQL commands and SQL-style functions on a DataFrame
- Using SQL-style functions instead of queries
- Using DataFrames instead of RDDs
- Summary
- Chapter 6: Other Spark Technologies and Libraries
- Introducing MLlib
- MLlib capabilities
- Special MLlib data types
- For more information on machine learning
- Making movie recommendations
- Using MLlib to produce movie recommendations
- Examining the movie-recommendations-als.py script
- Analyzing the ALS recommendations results
- Why did we get bad results?
- Using DataFrames with MLlib
- Examining the spark-linear-regression.py script
- Getting results
- Spark Streaming and GraphX
- What is Spark Streaming?
- GraphX
- Summary
- Chapter 7: Where to Go From Here?
- Learning More About Spark and Data Science
- Index.