Cargando…

Programming Pig.

This guide is an ideal learning tool and reference for Apache Pig, the open source engine for executing parallel data flows on Hadoop. With Pig, you can batch-process data without having to create a full-fledged application--making it easy for you to experiment with new datasets. Programming Pig int...

Descripción completa

Detalles Bibliográficos
Clasificación:Libro Electrónico
Autor principal: Gates, Alan
Formato: Electrónico eBook
Idioma:Inglés
Publicado: Sebastopol : O'Reilly Media, 2011.
Temas:
Acceso en línea:Texto completo (Requiere registro previo con correo institucional)
Tabla de Contenidos:
  • Table of Contents; Preface; Data Addiction; Who Should Read This Book; Conventions Used in This Book; Code Examples in This Book; Using Code Examples; Safari® Books Online; How to Contact Us; Acknowledgments; Chapter 1. Introduction; What Is Pig?; Pig on Hadoop; MapReduce's hello world; Pig Latin, a Parallel Dataflow Language; Comparing query and dataflow languages; How Pig differs from MapReduce; What Is Pig Useful For?; Pig Philosophy; Pig's History; Chapter 2. Installing and Running Pig; Downloading and Installing Pig; Downloading the Pig Package from Apache; Downloading Pig from Cloudera.
  • Downloading Pig Artifacts from MavenDownloading the Source; Running Pig; Running Pig Locally on Your Machine; Running Pig on Your Hadoop Cluster; Running Pig in the Cloud; Command-Line and Configuration Options; Return Codes; Chapter 3. Grunt; Entering Pig Latin Scripts in Grunt; HDFS Commands in Grunt; Controlling Pig from Grunt; Chapter 4. Pig's Data Model; Types; Scalar Types; Complex Types; Map; Tuple; Bag; Nulls; Schemas; Casts; Chapter 5. Introduction to Pig Latin; Preliminary Matters; Case Sensitivity; Comments; Input and Output; Load; Store; Dump; Relational Operations; foreach.
  • Expressions in foreachUDFs in foreach; Naming fields in foreach; Filter; Group; Order by; Distinct; Join; Limit; Sample; Parallel; User Defined Functions; Registering UDFs; Registering Python UDFs; define and UDFs; Calling Static Java Functions; Chapter 6. Advanced Pig Latin; Advanced Relational Operations; Advanced Features of foreach; flatten; Nested foreach; Using Different Join Implementations; Joining small to large data; Joining skewed data; Joining sorted data; cogroup; union; cross; Integrating Pig with Legacy Code and MapReduce; stream; mapreduce; Nonlinear Data Flows.
  • Controlling Executionset; Setting the Partitioner; Pig Latin Preprocessor; Parameter Substitution; Macros; Including Other Pig Latin Scripts; Chapter 7. Developing and Testing Pig Latin Scripts; Development Tools; Syntax Highlighting and Checking; describe; explain; illustrate; Pig Statistics; MapReduce Job Status; Debugging Tips; Testing Your Scripts with PigUnit; Chapter 8. Making Pig Fly; Writing Your Scripts to Perform Well; Filter Early and Often; Project Early and Often; Set Up Your Joins Properly; Use Multiquery When Possible; Choose the Right Data Type.
  • Select the Right Level of ParallelismWriting Your UDF to Perform; Tune Pig and Hadoop for Your Job; Using Compression in Intermediate Results; Data Layout Optimization; Bad Record Handling; Chapter 9. Embedding Pig Latin in Python; Compile; Bind; Binding Multiple Sets of Variables; Run; Running Multiple Bindings; Utility Methods; Chapter 10. Writing Evaluation and Filter Functions; Writing an Evaluation Function in Java; Where Your UDF Will Run; Evaluation Function Basics; Interacting with Pig values; Input and Output Schemas; Error Handling and Progress Reporting.