Spark in Action Covers Apache Spark 3 with Examples in Java, Python, and Scala.
Spark in Action, Second Edition , teaches you to create end-to-end analytics applications. In this entirely new book, you'll learn from interesting Java-based examples, including a complete data pipeline for processing NASA satellite data. And you'll discover Java, Python, and Scala code s...
Clasificación: | Libro Electrónico |
---|---|
Autor principal: | |
Formato: | Electrónico eBook |
Idioma: | Inglés |
Publicado: |
New York :
Manning Publications Co. LLC,
2020.
|
Colección: | ITpro collection
|
Temas: | |
Acceso en línea: | Texto completo (Requiere registro previo con correo institucional) |
Tabla de Contenidos:
- Intro
- Copyright
- brief contents
- contents
- front matter
- foreword
- The analytics operating system
- preface
- acknowledgments
- about this book
- Who should read this book
- What will you learn in this book?
- How this book is organized
- About the code
- liveBook discussion forum
- about the author
- about the cover illustration
- Part 1. The theory crippled by awesome examples
- 1. So, what is Spark, anyway?
- 1.1 The big picture: What Spark is and what it does
- 1.1.1 What is Spark?
- 1.1.2 The four pillars of mana
- 1.2 How can you use Spark?
- 1.2.1 Spark in a data processing/engineering scenario
- 1.2.2 Spark in a data science scenario
- 1.3 What can you do with Spark?
- 1.3.1 Spark predicts restaurant quality at NC eateries
- 1.3.2 Spark allows fast data transfer for Lumeris
- 1.3.3 Spark analyzes equipment logs for CERN
- 1.3.4 Other use cases
- 1.4 Why you will love the dataframe
- 1.4.1 The dataframe from a Java perspective
- 1.4.2 The dataframe from an RDBMS perspective
- 1.4.3 A graphical representation of the dataframe
- 1.5 Your first example
- 1.5.1 Recommended software
- 1.5.2 Downloading the code
- 1.5.3 Running your first application
- Command line
- Eclipse
- 1.5.4 Your first code
- Summary
- 2. Architecture and flow
- 2.1 Building your mental model
- 2.2 Using Java code to build your mental model
- 2.3 Walking through your application
- 2.3.1 Connecting to a master
- 2.3.2 Loading, or ingesting, the CSV file
- 2.3.3 Transforming your data
- 2.3.4 Saving the work done in your dataframe to a database
- Summary
- 3. The majestic role of the dataframe
- 3.1 The essential role of the dataframe in Spark
- 3.1.1 Organization of a dataframe
- 3.1.2 Immutability is not a swear word
- 3.2 Using dataframes through examples
- 3.2.1 A dataframe after a simple CSV ingestion.
- 6.2.2 Setting up the environment
- 6.3 Building your application to run on the cluster
- 6.3.1 Building your application's uber JAR
- 6.3.2 Building your application by using Git and Maven
- 6.4 Running your application on the cluster
- 6.4.1 Submitting the uber JAR
- 6.4.2 Running the application
- 6.4.3 the Spark user interface
- Summary
- Part 2. Ingestion
- 7. Ingestion from files
- 7.1 Common behaviors of parsers
- 7.2 Complex ingestion from CSV
- 7.2.1 Desired output
- 7.2.2 Code
- 7.3 Ingesting a CSV with a known schema
- 7.3.1 Desired output
- 7.3.2 Code
- 7.4 Ingesting a JSON file
- 7.4.1 Desired output
- 7.4.2 Code
- 7.5 Ingesting a multiline JSON file
- 7.5.1 Desired output
- 7.5.2 Code
- 7.6 Ingesting an XML file
- 7.6.1 Desired output
- 7.6.2 Code
- 7.7 Ingesting a text file
- 7.7.1 Desired output
- 7.7.2 Code
- 7.8 File formats for big data
- 7.8.1 The problem with traditional file formats
- 7.8.2 Avro is a schema-based serialization format
- 7.8.3 ORC is a columnar storage format
- 7.8.4 Parquet is also a columnar storage format
- 7.8.5 Comparing Avro, ORC, and Parquet
- 7.9 Ingesting Avro, ORC, and Parquet files
- 7.9.1 Ingesting Avro
- 7.9.2 Ingesting ORC
- 7.9.3 Ingesting Parquet
- 7.9.4 Reference table for ingesting Avro, ORC, or Parquet
- Summary
- 8. Ingestion from databases
- 8.1 Ingestion from relational databases
- 8.1.1 Database connection checklist
- 8.1.2 Understanding the data used in the examples
- 8.1.3 Desired output
- 8.1.4 Code
- 8.1.5 Alternative code
- 8.2 The role of the dialect
- 8.2.1 What is a dialect, anyway?
- 8.2.2 JDBC dialects provided with Spark
- 8.2.3 Building your own dialect
- 8.3 Advanced queries and ingestion
- 8.3.1 Filtering by using a WHERE clause
- 8.3.2 Joining data in the database
- 8.3.3 Performing Ingestion and partitioning.
- 8.3.4 Summary of advanced features
- 8.4 Ingestion from Elasticsearch
- 8.4.1 Data flow
- 8.4.2 The New York restaurants dataset digested by Spark
- 8.4.3 Code to ingest the restaurant dataset from Elasticsearch
- Summary
- 9 Advanced ingestion: finding data sources and building your own
- 9.1 What is a data source?
- 9.2 Benefits of a direct connection to a data source
- 9.2.1 Temporary files
- 9.2.2 Data quality scripts
- 9.2.3 Data on demand
- 9.3 Finding data sources at Spark Packages
- 9.4 Building your own data source
- 9.4.1 Scope of the example project
- 9.4.2 Your data source API and options
- 9.5 Behind the scenes: Building the data source itself
- 9.6 Using the register file and the advertiser class
- 9.7 Understanding the relationship between the data and schema
- 9.7.1 The data source builds the relation
- 9.7.2 Inside the relation
- 9.8 Building the schema from a JavaBean
- 9.9 Building the dataframe is magic with the utilities
- 9.10 The other classes
- Summary
- 10. Ingestion through structured streaming
- 10.1 What's streaming?
- 10.2 Creating your first stream
- 10.2.1 Generating a file stream
- 10.2.2 Consuming the records
- 10.2.3 Getting records, not lines
- 10.3 Ingesting data from network streams
- 10.4 Dealing with multiple streams
- 10.5 Differentiating discretized and structured streaming
- Summary
- Part 3. Transforming your data
- 11. Working with SQL
- 11.1 Working with Spark SQL
- 11.2 The difference between local and global views
- 11.3 Mixing the dataframe API and Spark SQL
- 11.4 Don't DELETE it!
- 11.5 Going further with SQL
- Summary
- 12 Transforming your data
- 12.1 What is data transformation?
- 12.2 Process and example of record-level transformation
- 12.2.1 Data discovery to understand the complexity
- 12.2.2 Data mapping to draw the process.
- 12.2.3 Writing the transformation code
- 12.2.4 Reviewing your data transformation to ensure a quality process
- What about sorting?
- Wrapping up your first Spark transformation
- 12.3 Joining datasets
- 12.3.1 A closer look at the datasets to join
- 12.3.2 Building the list of higher education institutions per county
- Initialization of Spark
- Loading and preparing the data
- 12.3.3 Performing the joins
- Joining the FIPS county identifier with the higher ed dataset using a join
- Joining the census data to get the county name
- 12.4 Performing more transformations
- Summary
- 13. Transforming entire documents
- 13.1 Transforming entire documents and their structure
- 13.1.1 Flattening your JSON document
- 13.1.2 Building nested documents for transfer and storage
- 13.2 The magic behind static functions
- 13.3 Performing more transformations
- Summary
- 14. Extending transformations with user-defined functions
- 14.1 Extending Apache Spark
- 14.2 Registering and calling a UDF
- 14.2.1 Registering the UDF with Spark
- 14.2.2 Using the UDF with the dataframe API
- 14.2.3 Manipulating UDFs with SQL
- 14.2.4 Implementing the UDF
- 14.2.5 Writing the service itself
- 14.3 Using UDFs to ensure a high level of data quality
- 14.4 Considering UDFs' constraints
- Summary
- 15. Aggregating your data
- 15.1 Aggregating data with Spark
- 15.1.1 A quick reminder on aggregations
- 15.1.2 Performing basic aggregations with Spark
- Performing an aggregation using the dataframe API
- Performing an aggregation using Spark SQL
- 15.2 Performing aggregations with live data
- 15.2.1 Preparing your dataset
- 15.2.2 Aggregating data to better understand the schools
- What is the average enrollment for each school?
- What is the evolution of the number of students?
- What is the higher enrollment per school and year?.