Apache Flume : Distributed Log Collection for Hadoop.
A starter guide that covers Apache Flume in detail. Apache Flume: Distributed Log Collection for Hadoop is intended for people who are responsible for moving datasets into Hadoop in a timely and reliable manner like software engineers, database administrators, and data warehouse administrators.
Clasificación: | Libro Electrónico |
---|---|
Autor principal: | |
Formato: | Electrónico eBook |
Idioma: | Inglés |
Publicado: |
Packt Publishing,
2013.
|
Colección: | Community experience distilled.
|
Temas: | |
Acceso en línea: | Texto completo (Requiere registro previo con correo institucional) |
Tabla de Contenidos:
- Cover; Copyright; Credits; About the Author; About the Reviewers; www.PacktPub.com; Table of Contents; Preface; Chapter 1: Overview and Architecture; Flume 0.9; Flume 1.X (Flume-NG); The problem with HDFS and streaming data/logs; Sources, channels, and sinks; Flume events; Interceptors, channel selectors, and sink processors; Tiered data collection (multiple flows and/or agents); Chapter 2: Flume Quick Start; Downloading Flume; Flume in Hadoop distributions; Flume configuration file overview; Starting up with Hello World
- Summary; Chapter 3: Channels; Memory channel; File channel; Summary.
- Chapter 4: Sinks and Sink ProcessorsHDFS sink; Path and filename; File rotation; Compression codecs; Event serializers; Text output; Text with headers; Apache Avro; File type; Sequence file; Data stream; Compressed stream; Timeouts and workers; Sink groups; Load balancing; Failover; Summary; Chapter 5: Sources and Channel Selectors; The problem with using tail; The exec source; The spooling directory source; Syslog sources; The syslog UDP source; The syslog TCP source; The multiport syslog TCP source; Channel selectors; Replicating; Multiplexing; Summary.
- Chapter 6: Interceptors, ETL, and RoutingInterceptors; Timestamp; Host; Static; Regular expression filtering; Regular expression extractor; Custom interceptors; Tiering data flows; Avro Source/Sink; Command-line Avro; Log4J Appender; The Load Balancing Log4J Appender; Routing; Summary; Chapter 7: Monitoring Flume; Monitoring the agent process; Monit; Nagios; Monitoring performance metrics; Ganglia; The internal HTTP server; Custom monitoring hooks; Summary; Chapter 8: There Is No Spoon
- The Realities of Real-time Distributed Data Collection; Transport time versus log time.
- Time zones are evilCapacity planning; Considerations for multiple data centers; Compliance and data expiry; Summary; Index.