ciando eBooks - ein Service Ihrer Bibliothek

	Preface	6
	Acknowledgments	7
	Contents	8
	Acronyms	11
	Part I Introduction	14
	1 Preliminaries	15
	1.1 Introduction	15
	1.1.1 Motivation	15
	1.1.2 Before the Deep Learning Era	16
	1.1.2.1 Feature Space Approaches	17
	1.1.2.2 Model Space Approaches	18
	1.2 Basic Formulation and Notations	18
	1.2.1 General Notations (Tables 1.1 and 1.2)	19
	1.2.2 Matrix and Vector Operations (Table 1.3)	20
	1.2.3 Probability Distribution Functions (Table 1.4)	20
	1.2.3.1 Expectation	21
	1.2.3.2 Kullback–Leibler Divergence	21
	1.2.4 Signal Processing	22
	1.2.5 Automatic Speech Recognition	23
	1.2.6 Hidden Markov Model	24
	1.2.7 Gaussian Mixture Model	25
	1.2.8 Neural Network	26
	1.3 Book Organization	27
	References	28
	Part II Approaches to Robust Automatic Speech Recognition	30
	2 Multichannel Speech Enhancement Approaches to DNN-Based Far-Field Speech Recognition	31
	2.1 Introduction	31
	2.1.1 Categories of Speech Enhancement	32
	2.1.2 Problem Formulation	32
	2.2 Dereverberation	34
	2.2.1 Problem Description	34
	2.2.2 Overview of Existing Dereverberation Approaches	36
	2.2.3 Linear-Prediction-Based Dereverberation	37
	2.3 Beamforming	39
	2.3.1 Types of Beamformers	40
	2.3.1.1 Delay-and-Sum Beamformer	40
	2.3.1.2 Minimum Variance Distortionless Response Beamformer	42
	2.3.1.3 Max-SNR Beamformer	43
	2.3.1.4 Multichannel Wiener Filter	44
	2.3.2 Parameter Estimation	45
	2.3.2.1 TDOA Estimation	46
	2.3.2.2 Steering-Vector Estimation	47
	2.3.2.3 Time–Frequency-Masking-Based Spatial Correlation Matrix Estimation	48
	2.4 Examples of Robust Front Ends	52
	2.4.1 A Reverberation-Robust ASR System	53
	2.4.1.1 Experimental Settings	53
	2.4.1.2 Experimental Results	53
	2.4.2 Robust ASR System for Mobile Devices	55
	2.4.2.1 Experimental Settings	55
	2.4.2.2 Experimental Results	56
	2.5 Concluding Remarks and Discussion	56
	References	57
	3 Multichannel Spatial Clustering Using Model-Based Source Separation	60
	3.1 Introduction	60
	3.2 Multichannel Speech Signals	61
	3.2.1 Binaural Cues Used by Human Listeners	62
	3.2.2 Parameters for More than Two Channels	64
	3.3 Spatial-Clustering Approaches	66
	3.3.1 Binwise Clustering and Alignment	67
	3.3.1.1 Cross-Frequency Source Alignment	68
	3.3.2 Fuzzy c-Means Clustering of Direction of Arrival	69
	3.3.3 Binaural Model-Based EM Source Separation and Localization (MESSL)	70
	3.3.4 Multichannel MESSL	71
	3.4 Mask-Smoothing Approaches	73
	3.4.1 Fuzzy Clustering with Context Information	73
	3.4.2 MESSL in a Markov Random Field	74
	3.4.2.1 Pairwise Markov Random Fields	74
	3.4.2.2 MESSL-MRF	75
	3.5 Driving Beamforming from Spatial Clustering	76
	3.6 Automatic Speech Recognition Experiments	78
	3.6.1 Results	79
	3.6.2 Example Separations	81
	3.7 Conclusion	83
	References	83
	4 Discriminative Beamforming with Phase-Aware Neural Networks for Speech Enhancement and Recognition	87
	4.1 Introduction	88
	4.2 Beamforming for ASR	88
	4.2.1 Geometric Beamforming	89
	4.2.2 Statistical Methods	91
	4.2.3 Learning-Based Methods	92
	4.2.3.1 Maximum Likelihood Approach	92
	4.2.3.2 Neural Network Approaches with Multichannel Inputs	93
	4.2.3.3 Neural Networks for Better Spatial-Statistics Estimation	94
	4.3 Beamforming Networks	95
	4.3.1 Motivation	95
	4.3.2 System Overview	95
	4.3.3 Predicting Beamforming Weights by DNN	97
	4.3.3.1 Extraction of GCC Features	98
	4.3.3.2 Beamforming Weight Vector	100
	4.3.4 Extraction of Log Mel Filterbanks	100
	4.3.5 Training Procedure	102
	4.4 Experiments	103
	4.4.1 Settings	103
	4.4.1.1 Corpus	103
	4.4.1.2 Network Configurations	104
	4.4.2 Beam Patterns	104
	4.4.3 Speech Enhancement Results	107
	4.4.4 Speech Recognition Results	107
	4.5 Summary and Future Directions	109
	References	110
	5 Raw Multichannel Processing Using Deep Neural Networks	113
	5.1 Introduction	114
	5.2 Experimental Details	116
	5.2.1 Data	116
	5.2.2 Baseline Acoustic Model	117
	5.3 Multichannel Raw-Waveform Neural Network	118
	5.3.1 Motivation	118
	5.3.2 Multichannel Filtering in the Time Domain	119
	5.3.3 Filterbank Spatial Diversity	120
	5.3.4 Comparison to Log Mel	123
	5.3.5 Comparison to Oracle Knowledge of Speech TDOA	124
	5.3.6 Summary	125
	5.4 Factoring Spatial and Spectral Selectivity	125
	5.4.1 Architecture	125
	5.4.2 Number of Spatial Filters	127
	5.4.3 Filter Analysis	127
	5.4.4 Results Summary	129
	5.5 Adaptive Beamforming	129
	5.5.1 NAB Model	129
	5.5.1.1 Adaptive Filters	130
	5.5.1.2 Gated Feedback	131
	5.5.1.3 Regularization with MTL	132
	5.5.2 NAB Filter Analysis	132
	5.5.3 Results Summary	133
	5.6 Filtering in the Frequency Domain	134
	5.6.1 Factored Model	134
	5.6.1.1 Spatial Filtering	134
	5.6.1.2 Spectral Filtering: Complex Linear Projection	134
	5.6.2 NAB Model	135
	5.6.3 Results: Factored Model	135
	5.6.3.1 Performance	135
	5.6.3.2 Comparison Between Learning in Time vs. Frequency	136
	5.6.4 Results: Adaptive Model	138
	5.7 Final Comparison, Rerecorded Data	138
	5.8 Conclusions and Future Work	139
	References	139
	6 Novel Deep Architectures in Speech Processing	142
	6.1 Introduction	143
	6.1.1 Relationship to the Literature	144
	6.2 General Formulation of Deep Unfolding	145
	6.3 Unfolding Markov Random Fields	147
	6.3.1 Mean-Field Inference	148
	6.3.2 Belief Propagation	150
	6.4 Deep Nonnegative Matrix Factorization	152
	6.5 Multichannel Deep Unfolding	155
	6.5.1 Source Separation Using Multichannel Gaussian Mixture Model	156
	6.5.2 Unfolding the Multichannel Gaussian Mixture Model	158
	6.5.3 MRF Extension of the MCGMM	159
	6.5.4 Experiments and Discussion	161
	6.6 End-to-End Deep Clustering	163
	6.6.1 Deep-Clustering Model	164
	6.6.2 Optimizing Signal Reconstruction	165
	6.6.3 End-to-End Training	166
	6.6.4 Experiments	167
	6.6.4.1 ASR Performance	167
	6.7 Conclusion	168
	References	168
	7 Deep Recurrent Networks for Separation and Recognition of Single-Channel Speech in Nonstationary Background Audio	172
	7.1 Introduction	172
	7.2 Problem Description	173
	7.3 Learning-Free Methods	175
	7.4 Nonnegative Matrix Factorization	176
	7.5 Deep Learning for Source Separation	177
	7.5.1 Recurrent and Long Short-Term Memory Networks	178
	7.5.2 Mask Versus Signal Prediction	179
	7.5.2.1 Ideal Masks and Phase-Sensitive Mask	179
	7.5.2.2 Evaluating Ideal Masks	180
	7.5.3 Loss Functions and Inputs	181
	7.5.4 Phase-Sensitive Approximation Loss Function	182
	7.5.5 Inputs to the Network	183
	7.5.5.1 Spectral Features	183
	7.5.5.2 Speech-State Information	183
	7.5.5.3 Enhanced Features	184
	7.6 Experiments and Results	185
	7.6.1 Neural Network Training	185
	7.6.2 Results on CHiME-2	186
	7.6.3 Discussion of Results	191
	7.7 Conclusion	191
	References	191
	8 Robust Features in Deep-Learning-Based Speech Recognition	194
	8.1 Introduction	195
	8.2 Background	197
	8.3 Approaches	198
	8.3.1 Speech Enhancement	199
	8.3.2 Signal-Theoretic Techniques	200
	8.3.3 Perceptually Motivated Features	200
	8.3.3.1 TempoRAl PatternS (TRAPS)	202
	8.3.3.2 Frequency-Domain Linear Prediction (FDLP)	203
	8.3.3.3 Power-Normalized Cepstral Coefficients (PNCC)	204
	8.3.3.4 Modulation Spectrum Features	204
	8.3.3.5 Normalized Modulation Coefficient (NMC)	205
	8.3.3.6 Modulation of Medium Duration Speech Amplitudes (MMeDuSA)	207
	8.3.3.7 Two Dimensional Modulation Extraction: Gabor Features	209
	8.3.3.8 Damped Oscillator Coefficient (DOC)	210
	8.3.4 Current Trends	212
	8.4 Case Studies	214
	8.4.1 Speech Processing for Noise- and Channel-Degraded Audio	214
	8.4.2 Speech Processing Under Reverberated Conditions	215
	8.5 Conclusion	217
	References	218
	9 Adaptation of Deep Neural Network Acoustic Models for Robust Automatic Speech Recognition	225
	9.1 Introduction	225
	9.1.1 DNN Adaptation Strategies	226
	9.1.1.1 Test-Time Adaptation	227
	9.1.1.2 Attribute-Aware Training	227
	9.1.1.3 Adaptive Training	227
	9.1.2 Overview of DNN Adaptation Methods	228
	9.1.2.1 Constrained Adaptation	228
	9.1.2.2 Feature Normalisation	228
	9.1.2.3 Feature Augmentation	229
	9.1.2.4 Structured DNN Parameterisation	229
	9.1.3 Chapter Organisation	229
	9.2 Feature Augmentation	230
	9.2.1 Speaker-Aware Training	231
	9.2.2 Noise-Aware Training	232
	9.2.3 Room-Aware Training	233
	9.2.4 Multiattribute-Aware Training	234
	9.2.5 Refinement of Augmented Features	236
	9.3 Structured DNN Parameterisation	237
	9.3.1 Structured Bias Vectors	237
	9.3.2 Structured Linear Transformation Adaptation	238
	9.3.3 Learning Hidden Unit Contribution	239
	9.3.4 SVD-Based Structure	239
	9.3.5 Factorised Hidden Layer Adaptation	240
	9.3.6 Cluster Adaptive Training for DNNs	241
	9.4 Summary and Future Directions	243
	References	244
	10 Training Data Augmentation and Data Selection	250
	10.1 Introduction	250
	10.1.1 Data Augmentation in the Literature	251
	10.1.2 Complementary Approaches	252
	10.2 Data Augmentation in Mismatched Environments	253
	10.2.1 Data Generation	253
	10.2.2 Speech Enhancement	254
	10.2.2.1 WPE-Based Dereverberation	254
	10.2.2.2 Denoising Autoencoder	255
	10.2.3 Results with Speech Enhancement on Test Data	255
	10.2.4 Results with Training Data Augmentation	256
	10.3 Data Selection	257
	10.3.1 Introduction	257
	10.3.2 Sequence-Summarizing Neural Network	258
	10.3.3 Configuration of the Neural Network	260
	10.3.4 Properties of the Extracted Vectors	261
	10.3.5 Results with Data Selection	262
	10.4 Conclusions	263
	References	263
	11 Advanced Recurrent Neural Networks for Automatic Speech Recognition	266
	11.1 Introduction	266
	11.2 Basic Deep Long Short-Term Memory RNNs	267
	11.2.1 Long Short-Term Memory RNNs	267
	11.2.2 Deep LSTM RNNs	268
	11.3 Prediction–Adaptation–Correction Recurrent Neural Networks	268
	11.4 Deep Long Short-Term Memory RNN Extensions	270
	11.4.1 Highway RNNs	270
	11.4.2 Bidirectional Highway LSTM RNNs	272
	11.4.3 Latency-Controlled Bidirectional Highway LSTM RNNs	272
	11.4.4 Grid LSTM RNNs	274
	11.4.5 Residual LSTM RNNs	275
	11.5 Experiment Setup	275
	11.5.1 Corpus	275
	11.5.1.1 IARPA-Babel Corpus	275
	11.5.1.2 AMI Meeting Corpus	275
	11.5.2 System Description	276
	11.6 Evaluation	277
	11.6.1 PAC-RNN	277
	11.6.1.1 Low-Resource Language	277
	11.6.1.2 Distant Speech Recognition	278
	11.6.2 Highway LSTMP	279
	11.6.2.1 Three-Layer Highway (B)LSTMP	279
	11.6.2.2 Highway (B)LSTMP with Dropout	279
	11.6.2.3 Deeper Highway LSTMP	280
	11.6.2.4 Grid LSTMP	280
	11.6.2.5 Residual LSTMP	281
	11.6.2.6 Summary of Results	281
	11.7 Conclusion	282
	References	283
	12 Sequence-Discriminative Training of Neural Networks	285
	12.1 Introduction	285
	12.2 Training Criteria	287
	12.2.1 Maximum Mutual Information	287
	12.2.2 Boosted Maximum Mutual Information	288
	12.2.3 Minimum Phone Error/State-Level Minimum Bayes Risk	289
	12.3 Practical Training Strategy	290
	12.3.1 Criterion Selection	290
	12.3.2 Frame-Smoothing	291
	12.3.3 Lattice Generation	292
	12.3.3.1 Numerator Lattice	292
	12.3.3.2 Denominator Lattice	293
	12.4 Two-Forward-Pass Method for Sequence Training	294
	12.5 Experiment Setup	295
	12.5.1 Corpus	296
	12.5.2 System Description	296
	12.6 Evaluation	297
	12.6.1 Practical Strategy	297
	12.6.2 Two-Forward-Pass Method	297
	12.6.2.1 Speed	298
	12.6.2.2 Performance	298
	12.7 Conclusion	299
	References	300
	13 End-to-End Architectures for Speech Recognition	302
	13.1 Introduction	302
	13.1.1 Complexity and Suboptimality of the Conventional ASR Pipeline	303
	13.1.2 Simplification of the Conventional ASR Pipeline	305
	13.1.3 End-to-End Learning	306
	13.2 End-to-End ASR Architectures	306
	13.2.1 Connectionist Temporal Classification	307
	13.2.2 Encoder–Decoder Paradigm	307
	13.2.3 Learning the Front End	309
	13.2.4 Other Ideas	310
	13.3 The EESEN Framework	310
	13.3.1 Model Structure	311
	13.3.2 Model Training	312
	13.3.3 Decoding	314
	13.3.3.1 Grammar	315
	13.3.3.2 Lexicon	315
	13.3.3.3 Token	316
	13.3.3.4 Search Graph	316
	13.3.4 Experiments and Analysis	317
	13.3.4.1 Wall Street Journal	317
	13.3.4.2 Switchboard	319
	13.3.4.3 HKUST Mandarin Chinese	320
	13.4 Summary and Future Directions	321
	References	322
	Part III Resources	327
	14 The CHiME Challenges: Robust Speech Recognition in Everyday Environments	328
	14.1 Introduction	328
	14.2 The 1st and 2nd CHiME Challenges (CHiME-1 and CHiME-2)	329
	14.2.1 Domestic Noise Background	330
	14.2.2 The Speech Recognition Task Design	330
	14.2.2.1 CHiME-1: Small Vocabulary	331
	14.2.2.2 CHiME-2 Track 1: Simulated Motion	331
	14.2.2.3 CHiME-2 Track 2: Medium Vocabulary	332
	14.2.3 Overview of System Performance	332
	14.2.4 Interim Conclusions	333
	14.3 The 3rd CHiME Challenge (CHiME-3)	334
	14.3.1 The Mobile Tablet Recordings	334
	14.3.2 The CHiME-3 Task Design: Real and Simulated Data	335
	14.3.3 The CHiME-3 Baseline Systems	336
	14.3.3.1 Simulation	336
	14.3.3.2 Enhancement	336
	14.3.3.3 ASR	337
	14.4 The CHiME-3 Evaluations	337
	14.4.1 An Overview of CHiME-3 System Performance	338
	14.4.2 An Overview of Successful Strategies	338
	14.4.2.1 Strategies for Improved Signal Enhancement	339
	14.4.2.2 Strategies for Improved Statistical Modelling	339
	14.4.2.3 Strategies for Improved System Training	340
	14.4.3 Key Findings	340
	14.5 Future Directions: CHiME-4 and Beyond	341
	References	343
	15 The REVERB Challenge: A Benchmark Task for Reverberation-Robust ASR Techniques	346
	15.1 Introduction	347
	15.2 Challenge Scenarios, Data, and Regulations	348
	15.2.1 Scenarios Assumed in the Challenge	348
	15.2.2 Data	348
	15.2.2.1 Test Data: Dev and Eval Test Sets	348
	15.2.2.2 Training Data	350
	15.2.3 Regulations	350
	15.3 Performance of Baseline and Top-Performing Systems	351
	15.3.1 Benchmark Results with GMM-HMM and DNN-HMM Systems	351
	15.3.2 Top-Performing 1-ch and 8-ch Systems	352
	15.3.3 Current State-of-the-Art Performance	353
	15.4 Summary and Remaining Challenges for Reverberant Speech Recognition	354
	References	354
	16 Distant Speech Recognition Experiments Using the AMI Corpus	356
	16.1 Introduction	356
	16.2 Meeting Corpora	357
	16.3 Baseline Speech Recognition Experiments	359
	16.4 Channel Concatenation Experiments	362
	16.5 Convolutional Neural Networks	363
	16.5.1 SDM Recordings	365
	16.5.2 MDM Recordings	365
	16.5.3 IHM Recordings	366
	16.6 Discussion and Conclusions	367
	References	367
	17 Toolkits for Robust Speech Processing	370
	17.1 Introduction	370
	17.2 General Speech Recognition Toolkits	371
	17.3 Language Model Toolkits	373
	17.4 Speech Enhancement Toolkits	375
	17.5 Deep Learning Toolkits	376
	17.6 End-to-End Speech Recognition Toolkits	378
	17.7 Other Resources for Speech Technology	380
	17.8 Conclusion	380
	References	381
	Part IV Applications	384
	18 Speech Research at Google to Enable Universal Speech Interfaces	385
	18.1 Early Development	385
	18.2 Voice Search	387
	18.3 Text to Speech	387
	18.4 Dictation/IME/Transcription	388
	18.5 Internationalization	389
	18.6 Neural-Network-Based Acoustic Modeling	391
	18.7 Adaptive Language Modeling	392
	18.8 Mobile-Device-Specific Technology	393
	18.9 Robustness	395
	References	396
	19 Challenges in and Solutions to Deep Learning Network Acoustic Modeling in Speech Recognition Products at Microsoft	400
	19.1 Introduction	401
	19.2 Effective and Efficient DL Modeling	401
	19.2.1 Reducing Run-Time Cost with SVD-Based Training	402
	19.2.2 Speaker Adaptation on Small Amount of Parameters	402
	19.2.2.1 SVD Bottleneck Adaptation	403
	19.2.2.2 DNN Adaptation Through Activation Function	404
	19.2.2.3 Low-Rank Plus Diagonal (LRPD) Adaptation	404
	19.2.3 Improving the Accuracy of Small-Size DNNs with Teacher–Student Training	405
	19.3 Invariance Modeling	406
	19.3.1 Improving the Robustness to Accent/Dialect with Model Adaptation	406
	19.3.2 Improving the Robustness to Acoustic Environment with Variable-Component DNN Modeling	408
	19.3.3 Improving the Time and Frequency Invariance with Time–Frequency Long Short-Term Memory RNNs	409
	19.3.4 Exploring the Generalization Capability to Unseen Data with Maximum Margin Sequence Training	409
	19.4 Effective Training-Data Usage	411
	19.4.1 Use of Unsupervised Data to Improve SR Accuracy	411
	19.4.2 Expanded Language Capability by Reusing Speech-Training Material Across Languages	412
	19.5 Conclusion	413
	References	414
	20 Advanced ASR Technologies for Mitsubishi Electric Speech Applications	417
	20.1 Introduction	417
	20.2 ASR for Car Navigation Systems	418
	20.2.1 Introduction	418
	20.2.2 ASR and Postprocessing Technologies	418
	20.2.2.1 ASR Using Statistical LM	418
	20.2.2.2 POI Name Search Using High-Speed Text Search Technique	419
	20.2.2.3 Application to Commercial Car Navigation System	420
	20.3 Dereverberation for Hands-Free Elevator	420
	20.3.1 Introduction	420
	20.3.2 A Dereverberation Method Using SS	421
	20.3.3 Experiments	422
	20.4 Discriminative Methods	423
	20.4.1 Introduction	423
	20.4.2 Discriminative Training for AMs	424
	20.4.3 Discriminative Training for RNN-LM	425
	20.5 Conclusion	426
	References	427
	Index	428