Distributed machine learning with Python : accelerating model training and serving with distributed systems /
Chapter 2: Parameter Server and All-Reduce -- Technical requirements -- Parameter server architecture -- Communication bottleneck in the parameter server architecture -- Sharding the model among parameter servers -- Implementing the parameter server -- Defining model layers -- Defining the parameter...
Clasificación: | Libro Electrónico |
---|---|
Autor principal: | |
Formato: | Electrónico eBook |
Idioma: | Inglés |
Publicado: |
Birmingham :
Packt Publishing, Limited,
2022.
|
Temas: | |
Acceso en línea: | Texto completo (Requiere registro previo con correo institucional) |
Tabla de Contenidos:
- Intro
- Title page
- Copyright and Credits
- Dedication
- Contributors
- Table of Contents
- Preface
- Section 1
- Data Parallelism
- Chapter 1: Splitting Input Data
- Single-node training is too slow
- The mismatch between data loading bandwidth and model training bandwidth
- Single-node training time on popular datasets
- Accelerating the training process with data parallelism
- Data parallelism
- the high-level bits
- Stochastic gradient descent
- Model synchronization
- Hyperparameter tuning
- Global batch size
- Learning rate adjustment
- Model synchronization schemes
- Collective communication
- Broadcast
- Gather
- All-Gather
- Summary
- Chapter 3: Building a Data Parallel Training and Serving Pipeline
- Technical requirements
- The data parallel training pipeline in a nutshell
- Input pre-processing
- Input data partition
- Data loading
- Training
- Model synchronization
- Model update
- Single-machine multi-GPUs and multi-machine multi-GPUs
- Single-machine multi-GPU
- Multi-machine multi-GPU
- Checkpointing and fault tolerance
- Model checkpointing
- Load model checkpoints
- Model evaluation and hyperparameter tuning
- Model serving in data parallelism
- Summary
- Chapter 4: Bottlenecks and Solutions
- Communication bottlenecks in data parallel training
- Analyzing the communication workloads
- Parameter server architecture
- The All-Reduce architecture
- The inefficiency of state-of-the-art communication schemes
- Leveraging idle links and host resources
- Tree All-Reduce
- Hybrid data transfer over PCIe and NVLink
- On-device memory bottlenecks
- Recomputation and quantization
- Recomputation
- Quantization
- Summary
- Section 2
- Model Parallelism
- Chapter 5: Splitting the Model
- Technical requirements
- Single-node training error
- out of memory
- Fine-tuning BERT on a single GPU
- Trying to pack a giant model inside one state-of-the-art GPU
- ELMo, BERT, and GPT
- Basic concepts
- RNN
- ELMo
- BERT
- GPT
- Pre-training and fine-tuning
- State-of-the-art hardware
- P100, V100, and DGX-1
- NVLink
- A100 and DGX-2
- NVSwitch
- Summary
- Chapter 6: Pipeline Input and Layer Split
- Vanilla model parallelism is inefficient
- Forward propagation
- Backward propagation
- GPU idle time between forward and backward propagation
- Pipeline input