Cargando…

Learning PySpark : build data-intensive applications locally and deploy at scale using the combined powers of Python and Spark 2.0 /

Annotation

Detalles Bibliográficos
Clasificación:Libro Electrónico
Autores principales: Drabas, Tomasz (Autor), Lee, Denny (Autor)
Otros Autores: Karau, Holden (writer of foreword.)
Formato: Electrónico eBook
Idioma:Inglés
Publicado: Birmingham, UK : Packt Publishing, 2017.
Temas:
Acceso en línea:Texto completo
Texto completo

MARC

LEADER 00000cam a2200000Ii 4500
001 EBSCO_ocn976408019
003 OCoLC
005 20231017213018.0
006 m o d
007 cr unu||||||||
008 170317s2017 enkab o 001 0 eng d
040 |a UMI  |b eng  |e rda  |e pn  |c UMI  |d TEFOD  |d OCLCF  |d IDEBK  |d STF  |d TOH  |d OCLCQ  |d N$T  |d COO  |d UOK  |d CEF  |d KSU  |d DEBBG  |d UAB  |d YDX  |d MOF  |d AU@  |d OCLCO  |d OCLCQ 
019 |a 1081417339 
020 |a 9781786466259  |q (electronic bk.) 
020 |a 1786466252  |q (electronic bk.) 
020 |z 9781786463708 
020 |z 1786463709 
029 1 |a GBVCP  |b 897169743 
035 |a (OCoLC)976408019  |z (OCoLC)1081417339 
037 |a CL0500000840  |b Safari Books Online 
037 |a 978A042E-251E-4460-88A6-41FFF582EF91  |b OverDrive, Inc.  |n http://www.overdrive.com 
050 4 |a QA76.76.A65 
072 7 |a COM  |x 021030  |2 bisacsh 
082 0 4 |a 005.7  |2 23 
049 |a UAMI 
100 1 |a Drabas, Tomasz,  |e author. 
245 1 0 |a Learning PySpark :  |b build data-intensive applications locally and deploy at scale using the combined powers of Python and Spark 2.0 /  |c Tomasz Drabas, Denny Lee ; foreword by Holden Karau. 
264 1 |a Birmingham, UK :  |b Packt Publishing,  |c 2017. 
300 |a 1 online resource (1 volume) :  |b illustrations, maps 
336 |a text  |b txt  |2 rdacontent 
337 |a computer  |b c  |2 rdamedia 
338 |a online resource  |b cr  |2 rdacarrier 
588 |a Description based on online resource; title from title page (viewed March 17, 2017). 
500 |a Includes index. 
505 0 |a Cover -- Copyright -- Credits -- Foreword -- About the Authors -- About the Reviewer -- www.PacktPub.com -- Customer Feedback -- Table of Contents -- Preface -- Chapter 1: Understanding Spark -- What is Apache Spark? -- Spark Jobs and APIs -- Execution process -- Resilient Distributed Dataset -- DataFrames -- Datasets -- Catalyst Optimizer -- Project Tungsten -- Spark 2.0 architecture -- Unifying Datasets and DataFrames -- Introducing SparkSession -- Tungsten phase 2 -- Structured streaming -- Continuous applications -- Summary -- Chapter 2: Resilient Distributed Datasets -- Internal workings of an RDD -- Creating RDDs -- Schema -- Reading from files -- Lambda expressions -- Global versus local scope -- Transformations -- The .map(...) transformation -- The .filter(...) transformation -- The .flatMap(...) transformation -- The .distinct(...) transformation -- The .sample(...) transformation -- The .leftOuterJoin(...) transformation -- The .repartition(...) transformation -- Actions -- The .take(...) method -- The .collect(...) method -- The .reduce(...) method -- The .count(...) method -- The .saveAsTextFile(...) method -- The .foreach(...) method -- Summary -- Chapter 3: DataFrames -- Python to RDD communications -- Catalyst Optimizer refresh -- Speeding up PySpark with DataFrames -- Creating DataFrames -- Generating our own JSON data -- Creating a DataFrame -- Creating a temporary table -- Simple DataFrame queries -- DataFrame API query -- SQL query -- Interoperating with RDDs -- Inferring the schema using reflection -- Programmatically specifying the schema -- Querying with the DataFrame API -- Number of rows -- Running filter statements -- Querying with SQL -- Number of rows -- Running filter statements using the where Clauses -- DataFrame scenario -- on-time flight performance -- Preparing the source datasets. 
505 8 |a Joining flight performance and airports -- Visualizing our flight-performance data -- Spark Dataset API -- Summary -- Chapter 4: Prepare Data for Modeling -- Checking for duplicates, missing observations, and outliers -- Duplicates -- Missing observations -- Outliers -- Getting familiar with your data -- Descriptive statistics -- Correlations -- Visualization -- Histograms -- Interactions between features -- Summary -- Chapter 5: Introducing MLlib -- Overview of the package -- Loading and transforming the data -- Getting to know your data -- Descriptive statistics -- Correlations -- Statistical testing -- Creating the final dataset -- Creating an RDD of LabeledPoints -- Splitting into training and testing -- Predicting infant survival -- Logistic regression in MLlib -- Selecting only the most predictable features -- Random forest in MLlib -- Summary -- Chapter 6: Introducing the ML Package -- Overview of the package -- Transformer -- Estimators -- Classification -- Regression -- Clustering -- Pipeline -- Predicting the chances of infant survival with ML -- Loading the data -- Creating transformers -- Creating an estimator -- Creating a pipeline -- Fitting the model -- Evaluating the performance of the model -- Saving the model -- Parameter hyper-tuning -- Grid search -- Train-validation splitting -- Other features of PySpark ML in action -- Feature extraction -- NLP -- related feature extractors -- Discretizing continuous variables -- Standardizing continuous variables -- Classification -- Clustering -- Finding clusters in the births dataset -- Topic mining -- Regression -- Summary -- Chapter 7: GraphFrames -- Introducing GraphFrames -- Installing GraphFrames -- Creating a library -- Preparing your flights dataset -- Building the graph -- Executing simple queries -- Determining the number of airports and trips. 
505 8 |a Determining the longest delay in this dataset -- Determining the number of delayed versus on-time/early flights -- What flights departing Seattle are most likely to have significant delays? -- What states tend to have significant delays departing from Seattle? -- Understanding vertex degrees -- Determining the top transfer airports -- Understanding motifs -- Determining airport ranking using PageRank -- Determining the most popular non-stop flights -- Using Breadth-First Search -- Visualizing flights using D3 -- Summary -- Chapter 8: TensorFrames -- What is Deep Learning? -- The need for neural networks and Deep Learning -- What is feature engineering? -- Bridging the data and algorithm -- What is TensorFlow? -- Installing Pip -- Installing TensorFlow -- Matrix multiplication using constants -- Matrix multiplication using placeholders -- Running the model -- Running another model -- Discussion -- Introducing TensorFrames -- TensorFrames -- quick start -- Configuration and setup -- Launching a Spark cluster -- Creating a TensorFrames library -- Installing TensorFlow on your cluster -- Using TensorFlow to add a constant to an existing column -- Executing the Tensor graph -- Blockwise reducing operations example -- Building a DataFrame of vectors -- Analysing the DataFrame -- Computing elementwise sum and min of all vectors -- Summary -- Chapter 9: Polyglot Persistence with Blaze -- Installing Blaze -- Polyglot persistence -- Abstracting data -- Working with NumPy arrays -- Working with pandas' DataFrame -- Working with files -- Working with databases -- Interacting with relational databases -- Interacting with the MongoDB database -- Data operations -- Accessing columns -- Symbolic transformations -- Operations on columns -- Reducing data -- Joins -- Summary -- Chapter 10: Structured Streaming -- What is Spark Streaming?. 
505 8 |a Why do we need Spark Streaming? -- What is the Spark Streaming application data flow? -- Simple streaming application using DStreams -- A quick primer on global aggregations -- Introducing Structured Streaming -- Summary -- Chapter 11: Packaging Spark Applications -- The spark-submit command -- Command line parameters -- Deploying the app programmatically -- Configuring your SparkSession -- Creating SparkSession -- Modularizing code -- Structure of the module -- Calculating the distance between two points -- Converting distance units -- Building an egg -- User defined functions in Spark -- Submitting a job -- Monitoring execution -- Databricks Jobs -- Summary -- Index. 
520 8 |a Annotation  |b Build data-intensive applications locally and deploy at scale using the combined powers of Python and Spark 2.0 About This Book - Learn why and how you can efficiently use Python to process data and build machine learning models in Apache Spark 2.0 - Develop and deploy efficient, scalable real-time Spark solutions - Take your understanding of using Spark with Python to the next level with this jump start guide Who This Book Is For If you are a Python developer who wants to learn about the Apache Spark 2.0 ecosystem, this book is for you. A firm understanding of Python is expected to get the best out of the book. Familiarity with Spark would be useful, but is not mandatory. What You Will Learn - Learn about Apache Spark and the Spark 2.0 architecture - Build and interact with Spark DataFrames using Spark SQL - Learn how to solve graph and deep learning problems using GraphFrames and TensorFrames respectively - Read, transform, and understand data and use it to train machine learning models - Build machine learning models with MLlib and ML - Learn how to submit your applications programmatically using spark-submit - Deploy locally built applications to a cluster In Detail Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This book will show you how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Spark 2.0 architecture and how to set up a Python environment for Spark. You will get familiar with the modules available in PySpark. You will learn how to abstract data with RDDs and DataFrames and understand the streaming capabilities of PySpark. Also, you will get a thorough overview of machine learning capabilities of PySpark using ML and MLlib, graph processing using GraphFrames, and polyglot persistence using Blaze. Finally, you will learn how to deploy your applications to the cloud using the spark-submit command. By the end of this book, you will have established a firm understanding of the Spark Python API and how it can be used to build data-intensive applications. Style and approach This book takes a very comprehensive, step-by-step approach so you understand how the Spark ecosystem can be used with Python to develop efficient, scalable solutions. Every chapter is standalone and written in a very easy-to-understand manner, with a focus on both the hows and the whys of each concept. 
590 |a O'Reilly  |b O'Reilly Online Learning: Academic/Public Library Edition 
590 |a eBooks on EBSCOhost  |b EBSCO eBook Subscription Academic Collection - Worldwide 
650 0 |a Application software  |x Development. 
650 0 |a Python (Computer program language) 
650 0 |a SPARK (Computer program language) 
650 6 |a Logiciels d'application  |x Développement. 
650 6 |a Python (Langage de programmation) 
650 7 |a COMPUTERS  |x Databases  |x Data Mining.  |2 bisacsh 
650 7 |a Application software  |x Development.  |2 fast  |0 (OCoLC)fst00811707 
650 7 |a Python (Computer program language)  |2 fast  |0 (OCoLC)fst01084736 
650 7 |a SPARK (Computer program language)  |2 fast  |0 (OCoLC)fst01922197 
700 1 |a Lee, Denny,  |e author. 
700 1 |a Karau, Holden,  |e writer of foreword. 
776 0 8 |i Print version:  |a Drabas, Tomasz.  |t Learning PySpark.  |d Birmingham : Packt Publishing, ©2017 
856 4 0 |u https://learning.oreilly.com/library/view/~/9781786463708/?ar  |z Texto completo 
856 4 0 |u https://ebsco.uam.elogim.com/login.aspx?direct=true&scope=site&db=nlebk&AN=1477650  |z Texto completo 
938 |a ProQuest MyiLibrary Digital eBook Collection  |b IDEB  |n cis35945158 
938 |a EBSCOhost  |b EBSC  |n 1477650 
938 |a YBP Library Services  |b YANK  |n 13522893 
994 |a 92  |b IZTAP