Cargando…

Apache Flume : Distributed Log Collection for Hadoop.

A starter guide that covers Apache Flume in detail. Apache Flume: Distributed Log Collection for Hadoop is intended for people who are responsible for moving datasets into Hadoop in a timely and reliable manner like software engineers, database administrators, and data warehouse administrators.

Detalles Bibliográficos
Clasificación:Libro Electrónico
Autor principal: Hoffman, Steve
Formato: Electrónico eBook
Idioma:Inglés
Publicado: Packt Publishing, 2013.
Colección:Community experience distilled.
Temas:
Acceso en línea:Texto completo (Requiere registro previo con correo institucional)
Tabla de Contenidos:
  • Cover; Copyright; Credits; About the Author; About the Reviewers; www.PacktPub.com; Table of Contents; Preface; Chapter 1: Overview and Architecture; Flume 0.9; Flume 1.X (Flume-NG); The problem with HDFS and streaming data/logs; Sources, channels, and sinks; Flume events; Interceptors, channel selectors, and sink processors; Tiered data collection (multiple flows and/or agents); Chapter 2: Flume Quick Start; Downloading Flume; Flume in Hadoop distributions; Flume configuration file overview; Starting up with Hello World
  • Summary; Chapter 3: Channels; Memory channel; File channel; Summary.
  • Chapter 4: Sinks and Sink ProcessorsHDFS sink; Path and filename; File rotation; Compression codecs; Event serializers; Text output; Text with headers; Apache Avro; File type; Sequence file; Data stream; Compressed stream; Timeouts and workers; Sink groups; Load balancing; Failover; Summary; Chapter 5: Sources and Channel Selectors; The problem with using tail; The exec source; The spooling directory source; Syslog sources; The syslog UDP source; The syslog TCP source; The multiport syslog TCP source; Channel selectors; Replicating; Multiplexing; Summary.
  • Chapter 6: Interceptors, ETL, and RoutingInterceptors; Timestamp; Host; Static; Regular expression filtering; Regular expression extractor; Custom interceptors; Tiering data flows; Avro Source/Sink; Command-line Avro; Log4J Appender; The Load Balancing Log4J Appender; Routing; Summary; Chapter 7: Monitoring Flume; Monitoring the agent process; Monit; Nagios; Monitoring performance metrics; Ganglia; The internal HTTP server; Custom monitoring hooks; Summary; Chapter 8: There Is No Spoon
  • The Realities of Real-time Distributed Data Collection; Transport time versus log time.
  • Time zones are evilCapacity planning; Considerations for multiple data centers; Compliance and data expiry; Summary; Index.