Hadoop backup and recovery solutions : learn the best strategies for data recovery from Hadoop backup clusters and troubleshoot problems /
If you are a Hadoop administrator and you want to get a good grounding in how to back up large amounts of data and manage Hadoop clusters, then this book is for you.
Clasificación: | Libro Electrónico |
---|---|
Autores principales: | , , |
Formato: | Electrónico eBook |
Idioma: | Inglés |
Publicado: |
Birmingham, UK :
Packt Publishing,
2015.
|
Colección: | Community experience distilled.
|
Temas: | |
Acceso en línea: | Texto completo |
Tabla de Contenidos:
- Cover
- Copyright
- Credits
- About the Authors
- About the Reviewers
- www.PacktPub.com
- Table of Contents
- Preface
- Chapter 1: Knowing Hadoop and Clustering Basics
- Understanding the need for Hadoop
- Apache Hive
- Apache Pig
- Apache HBase
- Apache HCatalog
- Understanding HDFS design
- Getting familiar with HDFS daemons
- Scenario 1 � writing data to the HDFS cluster
- Scenario 2 � reading data from the HDFS cluster
- Understanding the basics of Hadoop cluster
- Summary
- Chapter 2: Understanding Hadoop Backup and Recovery NeedsUnderstanding the backup and recovery philosophies
- Replication of data using DistCp
- Updating and overwriting using DistCp
- The backup philosophy
- Changes since the last backup
- The rate of new data arrival
- The size of the cluster
- Priority of the datasets
- Selecting the datasets or parts of datasets
- The timelines of data backups
- Reducing the window of possible data loss
- Backup consistency
- Avoiding invalid backups
- The recovery philosophy
- Knowing the necessity of backing up HadoopDetermining backup areas � what should I back up?
- Datasets
- Block size � a large file divided into blocks
- Replication factor
- A list of all the blocks of a file
- A list of DataNodes for each block � sorted by distance
- The ACK package
- The checksums
- The number of under-replicated blocks
- The secondary NameNode
- Active and passive nodes in second generation Hadoop
- Hardware failure
- Software failure
- Applications
- Configurations
- Is taking backup enough?
- Understanding the disaster recovery principleKnowing a disaster
- The need for recovery
- Understanding recovery areas
- Summary
- Chapter 3: Determining Backup Strategies
- Knowing the areas to be protected
- Understanding the common failure types
- Hardware failure
- Host failure
- Using commodity hardware
- Hardware failures may lead to loss of data
- User application failure
- Software causing task failure
- Failure of slow-running tasks
- Hadoop's handling of failing tasks
- Task failure due to data
- Bad data handling � through codeHadoop's skip mode
- Learning a way to define the backup strategy
- Why do I need a strategy?
- What should be considered in a strategy?
- Filesystem check (fsck)
- Filesystem balancer
- Upgrading your Hadoop cluster
- Designing network layout and rack awareness
- Most important areas to consider while defining a backup strategy
- Understanding the need for backing up Hive metadata
- What is Hive?
- Hive replication
- Summary
- Chapter 4: Backing Up Hadoop
- Data backup in Hadoop
- Distributed copy