Cargando…

CUDA Application Design and Development.

As the computer industry retools to leverage massively parallel graphics processing units (GPUs), this book is designed to meet the needs of working software developers who need to understand GPU programming with CUDA and increase efficiency in their projects. CUDA Application Design and Development...

Descripción completa

Detalles Bibliográficos
Clasificación:	Libro Electrónico
Autor principal:	Farber, Rob
Formato:	Electrónico eBook
Idioma:	Inglés
Publicado:	Burlington : Elsevier Science, 2011.
Temas:	Application design. Application software > Development. Application software. Computer architecture. Parallel programming (Computer science) Logiciels d'application > Développement. Logiciels d'application. Ordinateurs > Architecture. Programmation parallèle (Informatique) Application software Application software > Development Computer architecture
Acceso en línea:	Texto completo

Tabla de Contenidos:

Front Cover
CUDA Application Design and Development
Copyright
Dedication
Table of Contents
Foreword
Preface
1 First Programs and How to Think in CUDA
Source Code and Wiki
Distinguishing CUDA from Conventional Programming with a Simple Example
Choosing a CUDA API
Some Basic CUDA Concepts
Understanding Our First Runtime Kernel
Three Rules of GPGPU Programming
Rule 1: Get the Data on the GPU and Keep It There
Rule 2: Give the GPGPU Enough Work to Do
Rule 3: Focus on Data Reuse within the GPGPU to Avoid Memory Bandwidth Limitations
Big-O Considerations and Data Transfers
CUDA and Amdahl's Law
Data and Task Parallelism
Hybrid Execution: Using Both CPU and GPU Resources
Regression Testing and Accuracy
Silent Errors
Introduction to Debugging
UNIX Debugging
NVIDIA's cuda-gdb Debugger
The CUDA Memory Checker
Use cuda-gdb with the UNIX ddd Interface
Windows Debugging with Parallel Nsight
Summary
2 CUDA for Machine Learning and Optimization
Modeling and Simulation
Fitting Parameterized Models
Nelder-Mead Method
Levenberg-Marquardt Method
Algorithmic Speedups
Machine Learning and Neural Networks
XOR: An Important Nonlinear Machine-Learning Problem
An Example Objective Function
A Complete Functor for Multiple GPU Devices and the Host Processors
Brief Discussion of a Complete Nelder-Mead Optimization Code
Performance Results on XOR
Performance Discussion
Summary
The C++ Nelder-Mead Template
3 The CUDA Tool Suite: Profiling a PCA/NLPCA Functor
PCA and NLPCA
Autoencoders
An Example Functor for PCA Analysis
An Example Functor for NLPCA Analysis
Obtaining Basic Profile Information
Gprof: A Common UNIX Profiler
The NVIDIA Visual Profiler: Computeprof
Parallel Nsight for Microsoft Visual Studio.
The Nsight Timeline Analysis
The NVTX Tracing Library
Scaling Behavior of the CUDA API
Tuning and Analysis Utilities (TAU)
Summary
4 The CUDA Execution Model
GPU Architecture Overview
Thread Scheduling: Orchestrating Performance and Parallelism via the Execution Configuration
Relevant computeprof Values for a Warp
Warp Divergence
Guidelines for Warp Divergence
Relevant computeprof Values for Warp Divergence
Warp Scheduling and TLP
Relevant computeprof Values for Occupancy
ILP: Higher Performance at Lower Occupancy
ILP Hides Arithmetic Latency
ILP Hides Data Latency
ILP in the Future
Relevant computeprof Values for Instruction Rates
Little's Law
CUDA Tools to Identify Limiting Factors
The nvcc Compiler
Launch Bounds
The Disassembler
PTX Kernels
GPU Emulators
Summary
5 CUDA Memory
The CUDA Memory Hierarchy
GPU Memory
L2 Cache
Relevant computeprof Values for the L2 Cache
L1 Cache
Relevant computeprof Values for the L1 Cache
CUDA Memory Types
Registers
Local memory
Relevant computeprof Values for Local Memory Cache
Shared Memory
Relevant computeprof Values for Shared Memory
Constant Memory
Texture Memory
Relevant computeprof Values for Texture Memory
Global Memory
Common Coalescing Use Cases
Allocation of Global Memory
Limiting Factors in the Design of Global Memory
Relevant computeprof Values for Global Memory
Summary
6 Efficiently Using GPU Memory
Reduction
The Reduction Template
A Test Program for functionReduce.h
Results
Utilizing Irregular Data Structures
Sparse Matrices and the CUSP Library
Graph Algorithms
SoA, AoS, and Other Structures
Tiles and Stencils
Summary
7 Techniques to Increase Parallelism
CUDA Contexts Extend Parallelism
Streams and Contexts.
Multiple GPUs
Explicit Synchronization
Implicit Synchronization
The Unified Virtual Address Space
A Simple Example
Profiling Results
Out-of-Order Execution with Multiple Streams
Tip for Concurrent Kernel Execution on the Same GPU
Atomic Operations for Implicitly Concurrent Kernels
Tying Data to Computation
Manually Partitioning Data
Mapped Memory
How Mapped Memory Works
Summary
8 CUDA for All GPU and CPU Applications
Pathways from CUDA to Multiple Hardware Backends
The PGI CUDA x86 Compiler
The PGI CUDA x86 Compiler
An x86 core as an SM
The NVIDIA NVCC Compiler
Ocelot
Swan
MCUDA
Accessing CUDA from Other Languages
SWIG
Copperhead
EXCEL
MATLAB
Libraries
CUBLAS
CUFFT
MAGMA
phiGEMM Library
CURAND
Summary
9 Mixing CUDA and Rendering
OpenGL
GLUT
Mapping GPU Memory with OpenGL
Using Primitive Restart for 3D Performance
Introduction to the Files in the Framework
The Demo and Perlin Example Kernels
The Demo Kernel
The Demo Kernel to Generate a Colored Sinusoidal Surface
Perlin Noise
Using the Perlin Noise Kernel to Generate Artificial Terrain
The simpleGLmain.cpp File
The simpleVBO.cpp File
The callbacksVBO.cpp File
Summary
10 CUDA in a Cloud and Cluster Environments
The Message Passing Interface (MPI)
The MPI Programming Model
The MPI Communicator
MPI Rank
Master-Slave
Point-to-Point Basics
How MPI Communicates
Bandwidth
Balance Ratios
Considerations for Large MPI Runs
Scalability of the Initial Data Load
Using MPI to Perform a Calculation
Check Scalability
Cloud Computing
A Code Example
Data Generation
Summary
11 CUDA for Real Problems
Working with High-Dimensional Data
PCA/NLPCA
Multidimensional Scaling
K-Means Clustering.
Expectation-Maximization
Support Vector Machines
Bayesian Networks
Mutual information
Force-Directed Graphs
Monte Carlo Methods
Molecular Modeling
Quantum Chemistry
Interactive Workflows
A Plethora of Projects
Summary
12 Application Focus on Live Streaming Video
Topics in Machine Vision
3D Effects
Segmentation of Flesh-colored Regions
Edge Detection
FFmpeg
TCP Server
Live Stream Application
kernelWave(): An Animated Kernel
kernelFlat(): Render the Image on a Flat Surface
kernelSkin(): Keep Only Flesh-colored Regions
kernelSobel(): A Simple Sobel Edge Detection Filter
The launch_kernel() Method
The simpleVBO.cpp File
The callbacksVBO.cpp File
Building and Running the Code
The Future
Machine Learning
The Connectome
Summary
Listing for simpleVBO.cpp
Works Cited
Index
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X.

CUDA Application Design and Development.

Ejemplares similares