Cargando…

Hadoop : the definitive guide /

Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who w...

Descripción completa

Detalles Bibliográficos
Clasificación:Libro Electrónico
Autor principal: White, Tom (Tom E.) (Autor)
Formato: Electrónico eBook
Idioma:Inglés
Publicado: Sebastopol, CA : O'Reilly Media, 2015.
Edición:4th edition.
Temas:
Acceso en línea:Texto completo (Requiere registro previo con correo institucional)
Tabla de Contenidos:
  • Cover
  • Copyright
  • Table of Contents
  • Foreword
  • Preface
  • Administrative Notes
  • What's New in the Fourth Edition?
  • What's New in the Third Edition?
  • What's New in the Second Edition?
  • Conventions Used in This Book
  • Using Code Examples
  • Safari® Books Online
  • How to Contact Us
  • Acknowledgments
  • Part I. Hadoop Fundamentals
  • Chapter 1. Meet Hadoop
  • Data!
  • Data Storage and Analysis
  • Querying All Your Data
  • Beyond Batch
  • Comparison with Other Systems
  • Relational Database Management Systems
  • Grid Computing
  • Volunteer Computing
  • A Brief History of Apache Hadoop
  • What's in This Book?
  • Chapter 2. MapReduce
  • A Weather Dataset
  • Data Format
  • Analyzing the Data with Unix Tools
  • Analyzing the Data with Hadoop
  • Map and Reduce
  • Java MapReduce
  • Scaling Out
  • Data Flow
  • Combiner Functions
  • Running a Distributed MapReduce Job
  • Hadoop Streaming
  • Ruby
  • Python
  • Chapter 3. The Hadoop Distributed Filesystem
  • The Design of HDFS
  • HDFS Concepts
  • Blocks
  • Namenodes and Datanodes
  • Block Caching
  • HDFS Federation
  • HDFS High Availability
  • The Command-Line Interface
  • Basic Filesystem Operations
  • Hadoop Filesystems
  • Interfaces
  • The Java Interface
  • Reading Data from a Hadoop URL
  • Reading Data Using the FileSystem API
  • Writing Data
  • Directories
  • Querying the Filesystem
  • Deleting Data
  • Data Flow
  • Anatomy of a File Read
  • Anatomy of a File Write
  • Coherency Model
  • Parallel Copying with distcp
  • Keeping an HDFS Cluster Balanced
  • Chapter 4. YARN
  • Anatomy of a YARN Application Run
  • Resource Requests
  • Application Lifespan
  • Building YARN Applications
  • YARN Compared to MapReduce 1
  • Scheduling in YARN
  • Scheduler Options
  • Capacity Scheduler Configuration
  • Fair Scheduler Configuration
  • Delay Scheduling
  • Dominant Resource Fairness
  • Further Reading.
  • Chapter 5. Hadoop I/O
  • Data Integrity
  • Data Integrity in HDFS
  • LocalFileSystem
  • ChecksumFileSystem
  • Compression
  • Codecs
  • Compression and Input Splits
  • Using Compression in MapReduce
  • Serialization
  • The Writable Interface
  • Writable Classes
  • Implementing a Custom Writable
  • Serialization Frameworks
  • File-Based Data Structures
  • SequenceFile
  • MapFile
  • Other File Formats and Column-Oriented Formats
  • Part II. MapReduce
  • Chapter 6. Developing a MapReduce Application
  • The Configuration API
  • Combining Resources
  • Variable Expansion
  • Setting Up the Development Environment
  • Managing Configuration
  • GenericOptionsParser, Tool, and ToolRunner
  • Writing a Unit Test with MRUnit
  • Mapper
  • Reducer
  • Running Locally on Test Data
  • Running a Job in a Local Job Runner
  • Testing the Driver
  • Running on a Cluster
  • Packaging a Job
  • Launching a Job
  • The MapReduce Web UI
  • Retrieving the Results
  • Debugging a Job
  • Hadoop Logs
  • Remote Debugging
  • Tuning a Job
  • Profiling Tasks
  • MapReduce Workflows
  • Decomposing a Problem into MapReduce Jobs
  • JobControl
  • Apache Oozie
  • Chapter 7. How MapReduce Works
  • Anatomy of a MapReduce Job Run
  • Job Submission
  • Job Initialization
  • Task Assignment
  • Task Execution
  • Progress and Status Updates
  • Job Completion
  • Failures
  • Task Failure
  • Application Master Failure
  • Node Manager Failure
  • Resource Manager Failure
  • Shuffle and Sort
  • The Map Side
  • The Reduce Side
  • Configuration Tuning
  • Task Execution
  • The Task Execution Environment
  • Speculative Execution
  • Output Committers
  • Chapter 8. MapReduce Types and Formats
  • MapReduce Types
  • The Default MapReduce Job
  • Input Formats
  • Input Splits and Records
  • Text Input
  • Binary Input
  • Multiple Inputs
  • Database Input (and Output)
  • Output Formats
  • Text Output
  • Binary Output.
  • Multiple Outputs
  • Lazy Output
  • Database Output
  • Chapter 9. MapReduce Features
  • Counters
  • Built-in Counters
  • User-Defined Java Counters
  • User-Defined Streaming Counters
  • Sorting
  • Preparation
  • Partial Sort
  • Total Sort
  • Secondary Sort
  • Joins
  • Map-Side Joins
  • Reduce-Side Joins
  • Side Data Distribution
  • Using the Job Configuration
  • Distributed Cache
  • MapReduce Library Classes
  • Part III. Hadoop Operations
  • Chapter 10. Setting Up a Hadoop Cluster
  • Cluster Specification
  • Cluster Sizing
  • Network Topology
  • Cluster Setup and Installation
  • Installing Java
  • Creating Unix User Accounts
  • Installing Hadoop
  • Configuring SSH
  • Configuring Hadoop
  • Formatting the HDFS Filesystem
  • Starting and Stopping the Daemons
  • Creating User Directories
  • Hadoop Configuration
  • Configuration Management
  • Environment Settings
  • Important Hadoop Daemon Properties
  • Hadoop Daemon Addresses and Ports
  • Other Hadoop Properties
  • Security
  • Kerberos and Hadoop
  • Delegation Tokens
  • Other Security Enhancements
  • Benchmarking a Hadoop Cluster
  • Hadoop Benchmarks
  • User Jobs
  • Chapter 11. Administering Hadoop
  • HDFS
  • Persistent Data Structures
  • Safe Mode
  • Audit Logging
  • Tools
  • Monitoring
  • Logging
  • Metrics and JMX
  • Maintenance
  • Routine Administration Procedures
  • Commissioning and Decommissioning Nodes
  • Upgrades
  • Part IV. Related Projects
  • Chapter 12. Avro
  • Avro Data Types and Schemas
  • In-Memory Serialization and Deserialization
  • The Specific API
  • Avro Datafiles
  • Interoperability
  • Python API
  • Avro Tools
  • Schema Resolution
  • Sort Order
  • Avro MapReduce
  • Sorting Using Avro MapReduce
  • Avro in Other Languages
  • Chapter 13. Parquet
  • Data Model
  • Nested Encoding
  • Parquet File Format
  • Parquet Configuration
  • Writing and Reading Parquet Files.
  • Avro, Protocol Buffers, and Thrift
  • Parquet MapReduce
  • Chapter 14. Flume
  • Installing Flume
  • An Example
  • Transactions and Reliability
  • Batching
  • The HDFS Sink
  • Partitioning and Interceptors
  • File Formats
  • Fan Out
  • Delivery Guarantees
  • Replicating and Multiplexing Selectors
  • Distribution: Agent Tiers
  • Delivery Guarantees
  • Sink Groups
  • Integrating Flume with Applications
  • Component Catalog
  • Further Reading
  • Chapter 15. Sqoop
  • Getting Sqoop
  • Sqoop Connectors
  • A Sample Import
  • Text and Binary File Formats
  • Generated Code
  • Additional Serialization Systems
  • Imports: A Deeper Look
  • Controlling the Import
  • Imports and Consistency
  • Incremental Imports
  • Direct-Mode Imports
  • Working with Imported Data
  • Imported Data and Hive
  • Importing Large Objects
  • Performing an Export
  • Exports: A Deeper Look
  • Exports and Transactionality
  • Exports and SequenceFiles
  • Further Reading
  • Chapter 16. Pig
  • Installing and Running Pig
  • Execution Types
  • Running Pig Programs
  • Grunt
  • Pig Latin Editors
  • An Example
  • Generating Examples
  • Comparison with Databases
  • Pig Latin
  • Structure
  • Statements
  • Expressions
  • Types
  • Schemas
  • Functions
  • Macros
  • User-Defined Functions
  • A Filter UDF
  • An Eval UDF
  • A Load UDF
  • Data Processing Operators
  • Loading and Storing Data
  • Filtering Data
  • Grouping and Joining Data
  • Sorting Data
  • Combining and Splitting Data
  • Pig in Practice
  • Parallelism
  • Anonymous Relations
  • Parameter Substitution
  • Further Reading
  • Chapter 17. Hive
  • Installing Hive
  • The Hive Shell
  • An Example
  • Running Hive
  • Configuring Hive
  • Hive Services
  • The Metastore
  • Comparison with Traditional Databases
  • Schema on Read Versus Schema on Write
  • Updates, Transactions, and Indexes
  • SQL-on-Hadoop Alternatives
  • HiveQL
  • Data Types.
  • Operators and Functions
  • Tables
  • Managed Tables and External Tables
  • Partitions and Buckets
  • Storage Formats
  • Importing Data
  • Altering Tables
  • Dropping Tables
  • Querying Data
  • Sorting and Aggregating
  • MapReduce Scripts
  • Joins
  • Subqueries
  • Views
  • User-Defined Functions
  • Writing a UDF
  • Writing a UDAF
  • Further Reading
  • Chapter 18. Crunch
  • An Example
  • The Core Crunch API
  • Primitive Operations
  • Types
  • Sources and Targets
  • Functions
  • Materialization
  • Pipeline Execution
  • Running a Pipeline
  • Stopping a Pipeline
  • Inspecting a Crunch Plan
  • Iterative Algorithms
  • Checkpointing a Pipeline
  • Crunch Libraries
  • Further Reading
  • Chapter 19. Spark
  • Installing Spark
  • An Example
  • Spark Applications, Jobs, Stages, and Tasks
  • A Scala Standalone Application
  • A Java Example
  • A Python Example
  • Resilient Distributed Datasets
  • Creation
  • Transformations and Actions
  • Persistence
  • Serialization
  • Shared Variables
  • Broadcast Variables
  • Accumulators
  • Anatomy of a Spark Job Run
  • Job Submission
  • DAG Construction
  • Task Scheduling
  • Task Execution
  • Executors and Cluster Managers
  • Spark on YARN
  • Further Reading
  • Chapter 20. HBase
  • HBasics
  • Backdrop
  • Concepts
  • Whirlwind Tour of the Data Model
  • Implementation
  • Installation
  • Test Drive
  • Clients
  • Java
  • MapReduce
  • REST and Thrift
  • Building an Online Query Application
  • Schema Design
  • Loading Data
  • Online Queries
  • HBase Versus RDBMS
  • Successful Service
  • HBase
  • Praxis
  • HDFS
  • UI
  • Metrics
  • Counters
  • Further Reading
  • Chapter 21. ZooKeeper
  • Installing and Running ZooKeeper
  • An Example
  • Group Membership in ZooKeeper
  • Creating the Group
  • Joining a Group
  • Listing Members in a Group
  • Deleting a Group
  • The ZooKeeper Service
  • Data Model
  • Operations
  • Implementation.