Techniques for Noise Robustness in Automatic Speech Recognition.
Automatic speech recognition (ASR) systems are finding increasing use in everyday life. Many of the commonplace environments where the systems are used are noisy, for example users calling up a voice search system from a busy cafeteria or a street. This can result in degraded speech recordings and a...
Clasificación: | Libro Electrónico |
---|---|
Autor principal: | |
Otros Autores: | , |
Formato: | Electrónico eBook |
Idioma: | Inglés |
Publicado: |
New York :
Wiley,
2012.
|
Temas: | |
Acceso en línea: | Texto completo |
Tabla de Contenidos:
- -- List of Contributors xv
- Acknowledgments xvii
- 1 Introduction 1 / Tuomas Virtanen, Rita Singh, Bhiksha Raj
- 1.1 Scope of the Book 1
- 1.2 Outline 2
- 1.3 Notation 4
- Part One FOUNDATIONS
- 2 The Basics of Automatic Speech Recognition 9 / Rita Singh, Bhiksha Raj, Tuomas Virtanen
- 2.1 Introduction 9
- 2.2 Speech Recognition Viewed as Bayes Classification 10
- 2.3 Hidden Markov Models 11
- 2.3.1 Computing Probabilities with HMMs 12
- 2.3.2 Determining the State Sequence 17
- 2.3.3 Learning HMM Parameters 19
- 2.3.4 Additional Issues Relating to Speech Recognition Systems 20
- 2.4 HMM-Based Speech Recognition 24
- 2.4.1 Representing the Signal 24
- 2.4.2 The HMM for a Word Sequence 25
- 2.4.3 Searching through all Word Sequences 26
- References 29
- 3 The Problem of Robustness in Automatic Speech Recognition 31 / Bhiksha Raj, Tuomas Virtanen, Rita Singh
- 3.1 Errors in Bayes Classification 31
- 3.1.1 Type 1 Condition: Mismatch Error 33
- 3.1.2 Type 2 Condition: Increased Bayes Error 34
- 3.2 Bayes Classification and ASR 35
- 3.2.1 All We Have is a Model: A Type 1 Condition 35
- 3.2.2 Intrinsic Interferences
- Signal Components that are Unrelated to the Message: A Type 2 Condition 36
- 3.2.3 External Interferences
- The Data are Noisy: Type 1 and Type 2 Conditions 36
- 3.3 External Influences on Speech Recordings 36
- 3.3.1 Signal Capture 37
- 3.3.2 Additive Corruptions 41
- 3.3.3 Reverberation 42
- 3.3.4 A Simplified Model of Signal Capture 43
- 3.4 The Effect of External Influences on Recognition 44
- 3.5 Improving Recognition under Adverse Conditions 46
- 3.5.1 Handling the Model Mismatch Error 46
- 3.5.2 Dealing with Intrinsic Variations in the Data 47
- 3.5.3 Dealing with Extrinsic Variations 47
- References 50
- Part Two SIGNAL ENHANCEMENT
- 4 Voice Activity Detection, Noise Estimation, and Adaptive Filters for Acoustic Signal Enhancement 53 / Rainer Martin, Dorothea Kolossa
- 4.1 Introduction 53
- 4.2 Signal Analysis and Synthesis 55.
- 4.2.1 DFT-Based Analysis Synthesis with Perfect Reconstruction 55
- 4.2.2 Probability Distributions for Speech and Noise DFT Coefficients 57
- 4.3 Voice Activity Detection 58
- 4.3.1 VAD Design Principles 58
- 4.3.2 Evaluation of VAD Performance 62
- 4.3.3 Evaluation in the Context of ASR 62
- 4.4 Noise Power Spectrum Estimation 65
- 4.4.1 Smoothing Techniques 65
- 4.4.2 Histogram and GMM Noise Estimation Methods 67
- 4.4.3 Minimum Statistics Noise Power Estimation 67
- 4.4.4 MMSE Noise Power Estimation 68
- 4.4.5 Estimation of the A Priori Signal-to-Noise Ratio 69
- 4.5 Adaptive Filters for Signal Enhancement 71
- 4.5.1 Spectral Subtraction 71
- 4.5.2 Nonlinear Spectral Subtraction 73
- 4.5.3 Wiener Filtering 74
- 4.5.4 The ETSI Advanced Front End 75
- 4.5.5 Nonlinear MMSE Estimators 75
- 4.6 ASR Performance 80
- 4.7 Conclusions 81
- References 82
- 5 Extraction of Speech from Mixture Signals 87 / Paris Smaragdis
- 5.1 The Problem with Mixtures 87
- 5.2 Multichannel Mixtures 88
- 5.2.1 Basic Problem Formulation 88
- 5.2.2 Convolutive Mixtures 92
- 5.3 Single-Channel Mixtures 98
- 5.3.1 Problem Formulation 98
- 5.3.2 Learning Sound Models 100
- 5.3.3 Separation by Spectrogram Factorization 101
- 5.3.4 Dealing with Unknown Sounds 105
- 5.4 Variations and Extensions 107
- 5.5 Conclusions 107
- References 107
- 6 Microphone Arrays 109 / John McDonough, Kenichi Kumatani
- 6.1 Speaker Tracking 110
- 6.2 Conventional Microphone Arrays 113
- 6.3 Conventional Adaptive Beamforming Algorithms 120
- 6.3.1 Minimum Variance Distortionless Response Beamformer 120
- 6.3.2 Noise Field Models 122
- 6.3.3 Subband Analysis and Synthesis 123
- 6.3.4 Beamforming Performance Criteria 126
- 6.3.5 Generalized Sidelobe Canceller Implementation 129
- 6.3.6 Recursive Implementation of the GSC 130
- 6.3.7 Other Conventional GSC Beamformers 131
- 6.3.8 Beamforming based on Higher Order Statistics 132
- 6.3.9 Online Implementation 136
- 6.3.10 Speech-Recognition Experiments 140.
- 6.4 Spherical Microphone Arrays 142
- 6.5 Spherical Adaptive Algorithms 148
- 6.6 Comparative Studies 149
- 6.7 Comparison of Linear and Spherical Arrays for DSR 152
- 6.8 Conclusions and Further Reading 154
- References 155
- Part Three FEATURE ENHANCEMENT
- 7 From Signals to Speech Features by Digital Signal Processing 161 / Matthias W#x8A; olfel
- 7.1 Introduction 161
- 7.1.1 About this Chapter 162
- 7.2 The Speech Signal 162
- 7.3 Spectral Processing 163
- 7.3.1 Windowing 163
- 7.3.2 Power Spectrum 165
- 7.3.3 Spectral Envelopes 166
- 7.3.4 LP Envelope 166
- 7.3.5 MVDR Envelope 169
- 7.3.6 Warping the Frequency Axis 171
- 7.3.7 Warped LP Envelope 175
- 7.3.8 Warped MVDR Envelope 176
- 7.3.9 Comparison of Spectral Estimates 177
- 7.3.10 The Spectrogram 179
- 7.4 Cepstral Processing 179
- 7.4.1 Definition and Calculation of Cepstral Coefficients 180
- 7.4.2 Characteristics of Cepstral Sequences 181
- 7.5 Influence of Distortions on Different Speech Features 182
- 7.5.1 Objective Functions 182
- 7.5.2 Robustness against Noise 185
- 7.5.3 Robustness against Echo and Reverberation 187
- 7.5.4 Robustness against Changes in Fundamental Frequency 189
- 7.6 Summary and Further Reading 191
- References 191
- 8 Features Based on Auditory Physiology and Perception 193 / Richard M. Stern, Nelson Morgan
- 8.1 Introduction 193
- 8.2 Some Attributes of Auditory Physiology and Perception 194
- 8.2.1 Peripheral Processing 194
- 8.2.2 Processing at more Central Levels 200
- 8.2.3 Psychoacoustical Correlates of Physiological Observations 202
- 8.2.4 The Impact of Auditory Processing on Conventional Feature Extraction 206
- 8.2.5 Summary 208
- 8.3 "Classic" Auditory Representations 208
- 8.4 Current Trends in Auditory Feature Analysis 213
- 8.5 Summary 221
- Acknowledgments 222
- References 222
- 9 Feature Compensation 229 / Jasha Droppo
- 9.1 Life in an Ideal World 229
- 9.1.1 Noise Robustness Tasks 229
- 9.1.2 Probabilistic Feature Enhancement 230.
- 9.1.3 Gaussian Mixture Models 231
- 9.2 MMSE-SPLICE 232
- 9.2.1 Parameter Estimation 233
- 9.2.2 Results 236
- 9.3 Discriminative SPLICE 237
- 9.3.1 The MMI Objective Function 238
- 9.3.2 Training the Front-End Parameters 239
- 9.3.3 The Rprop Algorithm 240
- 9.3.4 Results 241
- 9.4 Model-Based Feature Enhancement 242
- 9.4.1 The Additive Noise-Mixing Equation 243
- 9.4.2 The Joint Probability Model 244
- 9.4.3 Vector Taylor Series Approximation 246
- 9.4.4 Estimating Clean Speech 247
- 9.4.5 Results 247
- 9.5 Switching Linear Dynamic System 248
- 9.6 Conclusion 249
- References 249
- 10 Reverberant Speech Recognition 251 / Reinhold Haeb-Umbach, Alexander Krueger
- 10.1 Introduction 251
- 10.2 The Effect of Reverberation 252
- 10.2.1 What is Reverberation? 252
- 10.2.2 The Relationship between Clean and Reverberant Speech Features 254
- 10.2.3 The Effect of Reverberation on ASR Performance 258
- 10.3 Approaches to Reverberant Speech Recognition 258
- 10.3.1 Signal-Based Techniques 259
- 10.3.2 Front-End Techniques 260
- 10.3.3 Back-End Techniques 262
- 10.3.4 Concluding Remarks 265
- 10.4 Feature Domain Model of the Acoustic Impulse Response 265
- 10.5 Bayesian Feature Enhancement 267
- 10.5.1 Basic Approach 268
- 10.5.2 Measurement Update 269
- 10.5.3 Time Update 270
- 10.5.4 Inference 271
- 10.6 Experimental Results 272
- 10.6.1 Databases 272
- 10.6.2 Overview of the Tested Methods 273
- 10.6.3 Recognition Results on Reverberant Speech 274
- 10.6.4 Recognition Results on Noisy Reverberant Speech 276
- 10.7 Conclusions 277
- Acknowledgment 278
- References 278
- Part Four MODEL ENHANCEMENT
- 11 Adaptation and Discriminative Training of Acoustic Models 285 / Yannick Est`eve, Paul Del'eglise
- 11.1 Introduction 285
- 11.1.1 Acoustic Models 286
- 11.1.2 Maximum Likelihood Estimation 287
- 11.2 Acoustic Model Adaptation and Noise Robustness 288
- 11.2.1 Static (or Offline) Adaptation 289
- 11.2.2 Dynamic (or Online) Adaptation 289.
- 11.3 Maximum A Posteriori Reestimation 290
- 11.4 Maximum Likelihood Linear Regression 293
- 11.4.1 Class Regression Tree 294
- 11.4.2 Constrained Maximum Likelihood Linear Regression 297
- 11.4.3 CMLLR Implementation 297
- 11.4.4 Speaker Adaptive Training 298
- 11.5 Discriminative Training 299
- 11.5.1 MMI Discriminative Training Criterion 301
- 11.5.2 MPE Discriminative Training Criterion 302
- 11.5.3 I-smoothing 303
- 11.5.4 MPE Implementation 304
- 11.6 Conclusion 307
- References 308
- 12 Factorial Models for Noise Robust Speech Recognition 311 / John R. Hershey, Steven J. Rennie, Jonathan Le Roux
- 12.1 Introduction 311
- 12.2 The Model-Based Approach 313
- 12.3 Signal Feature Domains 314
- 12.4 Interaction Models 317
- 12.4.1 Exact Interaction Model 318
- 12.4.2 Max Model 320
- 12.4.3 Log-Sum Model 321
- 12.4.4 Mel Interaction Model 321
- 12.5 Inference Methods 322
- 12.5.1 Max Model Inference 322
- 12.5.2 Parallel Model Combination 324
- 12.5.3 Vector Taylor Series Approaches 326
- 12.5.4 SNR-Dependent Approaches 331
- 12.6 Efficient Likelihood Evaluation in Factorial Models 332
- 12.6.1 Efficient Inference using the Max Model 332
- 12.6.2 Efficient Vector-Taylor Series Approaches 334
- 12.6.3 Band Quantization 335
- 12.7 Current Directions 337
- 12.7.1 Dynamic Noise Models for Robust ASR 338
- 12.7.2 Multi-Talker Speech Recognition using Graphical Models 339
- 12.7.3 Noise Robust ASR using Non-Negative Basis Representations 340
- References 341
- 13 Acoustic Model Training for Robust Speech Recognition 347 / Michael L. Seltzer
- 13.1 Introduction 347
- 13.2 Traditional Training Methods for Robust Speech Recognition 348
- 13.3 A Brief Overview of Speaker Adaptive Training 349
- 13.4 Feature-Space Noise Adaptive Training 351
- 13.4.1 Experiments using fNAT 352
- 13.5 Model-Space Noise Adaptive Training 353
- 13.6 Noise Adaptive Training using VTS Adaptation 355
- 13.6.1 Vector Taylor Series HMM Adaptation 355
- 13.6.2 Updating the Acoustic Model Parameters 357.
- 13.6.3 Updating the Environmental Parameters 360
- 13.6.4 Implementation Details 360
- 13.6.5 Experiments using NAT 361
- 13.7 Discussion 364
- 13.7.1 Comparison of Training Algorithms 364
- 13.7.2 Comparison to Speaker Adaptive Training 364
- 13.7.3 Related Adaptive Training Methods 365
- 13.8 Conclusion 366
- References 366
- Part Five COMPENSATION FOR INFORMATION LOSS
- 14 Missing-Data Techniques: Recognition with Incomplete Spectrograms 371 / Jon Barker
- 14.1 Introduction 371
- 14.2 Classification with Incomplete Data 373
- 14.2.1 A Simple Missing Data Scenario 374
- 14.2.2 Missing Data Theory 376
- 14.2.3 Validity of the MAR Assumption 378
- 14.2.4 Marginalising Acoustic Models 379
- 14.3 Energetic Masking 381
- 14.3.1 The Max Approximation 381
- 14.3.2 Bounded Marginalisation 382
- 14.3.3 Missing Data ASR in the Cepstral Domain 384
- 14.3.4 Missing Data ASR with Dynamic Features 386
- 14.4 Meta-Missing Data: Dealing with Mask Uncertainty 388
- 14.4.1 Missing Data with Soft Masks 388
- 14.4.2 Sub-band Combination Approaches 391
- 14.4.3 Speech Fragment Decoding 393
- 14.5 Some Perspectives on Performance 395
- References 396
- 15 Missing-Data Techniques: Feature Reconstruction 399 / Jort Florent Gemmeke, Ulpu Remes
- 15.1 Introduction 399
- 15.2 Missing-Data Techniques 401
- 15.3 Correlation-Based Imputation 402
- 15.3.1 Fundamentals 402
- 15.3.2 Implementation 404
- 15.4 Cluster-Based Imputation 406
- 15.4.1 Fundamentals 406
- 15.4.2 Implementation 408
- 15.4.3 Advances 409
- 15.5 Class-Conditioned Imputation 411
- 15.5.1 Fundamentals 411
- 15.5.2 Implementation 412
- 15.5.3 Advances 413
- 15.6 Sparse Imputation 414
- 15.6.1 Fundamentals 414
- 15.6.2 Implementation 416
- 15.6.3 Advances 418
- 15.7 Other Feature-Reconstruction Methods 420
- 15.7.1 Parametric Approaches 420
- 15.7.2 Nonparametric Approaches 421
- 15.8 Experimental Results 421
- 15.8.1 Feature-Reconstruction Methods 422
- 15.8.2 Comparison with Other Methods 424.
- 15.8.3 Advances 426
- 15.8.4 Combination with Other Methods 427
- 15.9 Discussion and Conclusion 428
- Acknowledgments 429
- References 430
- 16 Computational Auditory Scene Analysis and Automatic Speech Recognition 433 / Arun Narayanan, DeLiang Wang
- 16.1 Introduction 433
- 16.2 Auditory Scene Analysis 434
- 16.3 Computational Auditory Scene Analysis 435
- 16.3.1 Ideal Binary Mask 435
- 16.3.2 Typical CASA Architecture 438
- 16.4 CASA Strategies 440
- 16.4.1 IBM Estimation Based on Local SNR Estimates 440
- 16.4.2 IBM Estimation using ASA Cues 442
- 16.4.3 IBM Estimation as Binary Classification 448
- 16.4.4 Binaural Mask Estimation Strategies 451
- 16.5 Integrating CASA with ASR 452
- 16.5.1 Uncertainty Transform Model 454
- 16.6 Concluding Remarks 458
- Acknowledgment 458
- References 458
- 17 Uncertainty Decoding 463 / Hank Liao
- 17.1 Introduction 463
- 17.2 Observation Uncertainty 465
- 17.3 Uncertainty Decoding 466
- 17.4 Feature-Based Uncertainty Decoding 468
- 17.4.1 SPLICE with Uncertainty 470
- 17.4.2 Front-End Joint Uncertainty Decoding 471
- 17.4.3 Issues with Feature-Based Uncertainty Decoding 472
- 17.5 Model-Based Joint Uncertainty Decoding 473
- 17.5.1 Parameter Estimation 475
- 17.5.2 Comparisons with Other Methods 476
- 17.6 Noisy CMLLR 477
- 17.7 Uncertainty and Adaptive Training 480
- 17.7.1 Gradient-Based Methods 481
- 17.7.2 Factor Analysis Approaches 482
- 17.8 In Combination with Other Techniques 483
- 17.9 Conclusions 484
- References 485
- Index 487.