Loading…

Distributed machine learning with Python : accelerating model training and serving with distributed systems /

Chapter 2: Parameter Server and All-Reduce -- Technical requirements -- Parameter server architecture -- Communication bottleneck in the parameter server architecture -- Sharding the model among parameter servers -- Implementing the parameter server -- Defining model layers -- Defining the parameter...

Full description

Bibliographic Details
Call Number:Libro Electrónico
Main Author: Wang, Guanhua
Format: Electronic eBook
Language:Inglés
Published: Birmingham : Packt Publishing, Limited, 2022.
Subjects:
Online Access:Texto completo (Requiere registro previo con correo institucional)

MARC

LEADER 00000cam a22000007a 4500
001 OR_on1312162521
003 OCoLC
005 20231017213018.0
006 m o d
007 cr cnu---unuuu
008 220423s2022 enka o 000 0 eng d
040 |a EBLCP  |b eng  |e pn  |c EBLCP  |d ORMDA  |d OCLCO  |d UKMGB  |d OCLCF  |d OCLCQ  |d N$T  |d UKAHL  |d OCLCQ  |d IEEEE 
015 |a GBC274179  |2 bnb 
016 7 |a 020566484  |2 Uk 
020 |a 1801817219 
020 |a 9781801817219  |q (electronic bk.) 
020 |z 9781801815697  |q (pbk.) 
029 1 |a AU@  |b 000071607833 
029 1 |a UKMGB  |b 020566484 
035 |a (OCoLC)1312162521 
037 |a 9781801815697  |b O'Reilly Media 
037 |a 10163213  |b IEEE 
050 4 |a Q325.5 
082 0 4 |a 006.3/1  |2 23/eng/20220503 
049 |a UAMI 
100 1 |a Wang, Guanhua. 
245 1 0 |a Distributed machine learning with Python :  |b accelerating model training and serving with distributed systems /  |c Guanhua Wang. 
260 |a Birmingham :  |b Packt Publishing, Limited,  |c 2022. 
300 |a 1 online resource (284 pages) :  |b color illustrations 
336 |a text  |b txt  |2 rdacontent 
337 |a computer  |b c  |2 rdamedia 
338 |a online resource  |b cr  |2 rdacarrier 
588 0 |a Print version record. 
505 0 |a Intro -- Title page -- Copyright and Credits -- Dedication -- Contributors -- Table of Contents -- Preface -- Section 1 -- Data Parallelism -- Chapter 1: Splitting Input Data -- Single-node training is too slow -- The mismatch between data loading bandwidth and model training bandwidth -- Single-node training time on popular datasets -- Accelerating the training process with data parallelism -- Data parallelism -- the high-level bits -- Stochastic gradient descent -- Model synchronization -- Hyperparameter tuning -- Global batch size -- Learning rate adjustment -- Model synchronization schemes 
520 |a Chapter 2: Parameter Server and All-Reduce -- Technical requirements -- Parameter server architecture -- Communication bottleneck in the parameter server architecture -- Sharding the model among parameter servers -- Implementing the parameter server -- Defining model layers -- Defining the parameter server -- Defining the worker -- Passing data between the parameter server and worker -- Issues with the parameter server -- The parameter server architecture introduces a high coding complexity for practitioners -- All-Reduce architecture -- Reduce -- All-Reduce -- Ring All-Reduce. 
505 8 |a Collective communication -- Broadcast -- Gather -- All-Gather -- Summary -- Chapter 3: Building a Data Parallel Training and Serving Pipeline -- Technical requirements -- The data parallel training pipeline in a nutshell -- Input pre-processing -- Input data partition -- Data loading -- Training -- Model synchronization -- Model update -- Single-machine multi-GPUs and multi-machine multi-GPUs -- Single-machine multi-GPU -- Multi-machine multi-GPU -- Checkpointing and fault tolerance -- Model checkpointing -- Load model checkpoints -- Model evaluation and hyperparameter tuning 
505 8 |a Model serving in data parallelism -- Summary -- Chapter 4: Bottlenecks and Solutions -- Communication bottlenecks in data parallel training -- Analyzing the communication workloads -- Parameter server architecture -- The All-Reduce architecture -- The inefficiency of state-of-the-art communication schemes -- Leveraging idle links and host resources -- Tree All-Reduce -- Hybrid data transfer over PCIe and NVLink -- On-device memory bottlenecks -- Recomputation and quantization -- Recomputation -- Quantization -- Summary -- Section 2 -- Model Parallelism -- Chapter 5: Splitting the Model 
505 8 |a Technical requirements -- Single-node training error -- out of memory -- Fine-tuning BERT on a single GPU -- Trying to pack a giant model inside one state-of-the-art GPU -- ELMo, BERT, and GPT -- Basic concepts -- RNN -- ELMo -- BERT -- GPT -- Pre-training and fine-tuning -- State-of-the-art hardware -- P100, V100, and DGX-1 -- NVLink -- A100 and DGX-2 -- NVSwitch -- Summary -- Chapter 6: Pipeline Input and Layer Split -- Vanilla model parallelism is inefficient -- Forward propagation -- Backward propagation -- GPU idle time between forward and backward propagation -- Pipeline input 
500 |a Pros and cons of pipeline parallelism. 
590 |a O'Reilly  |b O'Reilly Online Learning: Academic/Public Library Edition 
650 0 |a Machine learning. 
650 0 |a Python (Computer program language) 
650 6 |a Apprentissage automatique. 
650 6 |a Python (Langage de programmation) 
650 7 |a Machine learning.  |2 fast  |0 (OCoLC)fst01004795 
650 7 |a Python (Computer program language)  |2 fast  |0 (OCoLC)fst01084736 
776 0 8 |i Print version:  |a Wang, Guanhua.  |t Distributed Machine Learning with Python.  |d Birmingham : Packt Publishing, Limited, ©2022 
856 4 0 |u https://learning.oreilly.com/library/view/~/9781801815697/?ar  |z Texto completo (Requiere registro previo con correo institucional) 
938 |a Askews and Holts Library Services  |b ASKH  |n AH39813577 
938 |a ProQuest Ebook Central  |b EBLB  |n EBL6956758 
938 |a EBSCOhost  |b EBSC  |n 3242106 
994 |a 92  |b IZTAP