|
Preface |
6 |
|
|
Acknowledgments |
7 |
|
|
Contents |
8 |
|
|
Acronyms |
11 |
|
|
Part I Introduction |
14 |
|
|
1 Preliminaries |
15 |
|
|
1.1 Introduction |
15 |
|
|
1.1.1 Motivation |
15 |
|
|
1.1.2 Before the Deep Learning Era |
16 |
|
|
1.1.2.1 Feature Space Approaches |
17 |
|
|
1.1.2.2 Model Space Approaches |
18 |
|
|
1.2 Basic Formulation and Notations |
18 |
|
|
1.2.1 General Notations (Tables 1.1 and 1.2) |
19 |
|
|
1.2.2 Matrix and Vector Operations (Table 1.3) |
20 |
|
|
1.2.3 Probability Distribution Functions (Table 1.4) |
20 |
|
|
1.2.3.1 Expectation |
21 |
|
|
1.2.3.2 Kullback–Leibler Divergence |
21 |
|
|
1.2.4 Signal Processing |
22 |
|
|
1.2.5 Automatic Speech Recognition |
23 |
|
|
1.2.6 Hidden Markov Model |
24 |
|
|
1.2.7 Gaussian Mixture Model |
25 |
|
|
1.2.8 Neural Network |
26 |
|
|
1.3 Book Organization |
27 |
|
|
References |
28 |
|
|
Part II Approaches to Robust Automatic Speech Recognition |
30 |
|
|
2 Multichannel Speech Enhancement Approaches to DNN-Based Far-Field Speech Recognition |
31 |
|
|
2.1 Introduction |
31 |
|
|
2.1.1 Categories of Speech Enhancement |
32 |
|
|
2.1.2 Problem Formulation |
32 |
|
|
2.2 Dereverberation |
34 |
|
|
2.2.1 Problem Description |
34 |
|
|
2.2.2 Overview of Existing Dereverberation Approaches |
36 |
|
|
2.2.3 Linear-Prediction-Based Dereverberation |
37 |
|
|
2.3 Beamforming |
39 |
|
|
2.3.1 Types of Beamformers |
40 |
|
|
2.3.1.1 Delay-and-Sum Beamformer |
40 |
|
|
2.3.1.2 Minimum Variance Distortionless Response Beamformer |
42 |
|
|
2.3.1.3 Max-SNR Beamformer |
43 |
|
|
2.3.1.4 Multichannel Wiener Filter |
44 |
|
|
2.3.2 Parameter Estimation |
45 |
|
|
2.3.2.1 TDOA Estimation |
46 |
|
|
2.3.2.2 Steering-Vector Estimation |
47 |
|
|
2.3.2.3 Time–Frequency-Masking-Based Spatial Correlation Matrix Estimation |
48 |
|
|
2.4 Examples of Robust Front Ends |
52 |
|
|
2.4.1 A Reverberation-Robust ASR System |
53 |
|
|
2.4.1.1 Experimental Settings |
53 |
|
|
2.4.1.2 Experimental Results |
53 |
|
|
2.4.2 Robust ASR System for Mobile Devices |
55 |
|
|
2.4.2.1 Experimental Settings |
55 |
|
|
2.4.2.2 Experimental Results |
56 |
|
|
2.5 Concluding Remarks and Discussion |
56 |
|
|
References |
57 |
|
|
3 Multichannel Spatial Clustering Using Model-Based Source Separation |
60 |
|
|
3.1 Introduction |
60 |
|
|
3.2 Multichannel Speech Signals |
61 |
|
|
3.2.1 Binaural Cues Used by Human Listeners |
62 |
|
|
3.2.2 Parameters for More than Two Channels |
64 |
|
|
3.3 Spatial-Clustering Approaches |
66 |
|
|
3.3.1 Binwise Clustering and Alignment |
67 |
|
|
3.3.1.1 Cross-Frequency Source Alignment |
68 |
|
|
3.3.2 Fuzzy c-Means Clustering of Direction of Arrival |
69 |
|
|
3.3.3 Binaural Model-Based EM Source Separation and Localization (MESSL) |
70 |
|
|
3.3.4 Multichannel MESSL |
71 |
|
|
3.4 Mask-Smoothing Approaches |
73 |
|
|
3.4.1 Fuzzy Clustering with Context Information |
73 |
|
|
3.4.2 MESSL in a Markov Random Field |
74 |
|
|
3.4.2.1 Pairwise Markov Random Fields |
74 |
|
|
3.4.2.2 MESSL-MRF |
75 |
|
|
3.5 Driving Beamforming from Spatial Clustering |
76 |
|
|
3.6 Automatic Speech Recognition Experiments |
78 |
|
|
3.6.1 Results |
79 |
|
|
3.6.2 Example Separations |
81 |
|
|
3.7 Conclusion |
83 |
|
|
References |
83 |
|
|
4 Discriminative Beamforming with Phase-Aware Neural Networks for Speech Enhancement and Recognition |
87 |
|
|
4.1 Introduction |
88 |
|
|
4.2 Beamforming for ASR |
88 |
|
|
4.2.1 Geometric Beamforming |
89 |
|
|
4.2.2 Statistical Methods |
91 |
|
|
4.2.3 Learning-Based Methods |
92 |
|
|
4.2.3.1 Maximum Likelihood Approach |
92 |
|
|
4.2.3.2 Neural Network Approaches with Multichannel Inputs |
93 |
|
|
4.2.3.3 Neural Networks for Better Spatial-Statistics Estimation |
94 |
|
|
4.3 Beamforming Networks |
95 |
|
|
4.3.1 Motivation |
95 |
|
|
4.3.2 System Overview |
95 |
|
|
4.3.3 Predicting Beamforming Weights by DNN |
97 |
|
|
4.3.3.1 Extraction of GCC Features |
98 |
|
|
4.3.3.2 Beamforming Weight Vector |
100 |
|
|
4.3.4 Extraction of Log Mel Filterbanks |
100 |
|
|
4.3.5 Training Procedure |
102 |
|
|
4.4 Experiments |
103 |
|
|
4.4.1 Settings |
103 |
|
|
4.4.1.1 Corpus |
103 |
|
|
4.4.1.2 Network Configurations |
104 |
|
|
4.4.2 Beam Patterns |
104 |
|
|
4.4.3 Speech Enhancement Results |
107 |
|
|
4.4.4 Speech Recognition Results |
107 |
|
|
4.5 Summary and Future Directions |
109 |
|
|
References |
110 |
|
|
5 Raw Multichannel Processing Using Deep Neural Networks |
113 |
|
|
5.1 Introduction |
114 |
|
|
5.2 Experimental Details |
116 |
|
|
5.2.1 Data |
116 |
|
|
5.2.2 Baseline Acoustic Model |
117 |
|
|
5.3 Multichannel Raw-Waveform Neural Network |
118 |
|
|
5.3.1 Motivation |
118 |
|
|
5.3.2 Multichannel Filtering in the Time Domain |
119 |
|
|
5.3.3 Filterbank Spatial Diversity |
120 |
|
|
5.3.4 Comparison to Log Mel |
123 |
|
|
5.3.5 Comparison to Oracle Knowledge of Speech TDOA |
124 |
|
|
5.3.6 Summary |
125 |
|
|
5.4 Factoring Spatial and Spectral Selectivity |
125 |
|
|
5.4.1 Architecture |
125 |
|
|
5.4.2 Number of Spatial Filters |
127 |
|
|
5.4.3 Filter Analysis |
127 |
|
|
5.4.4 Results Summary |
129 |
|
|
5.5 Adaptive Beamforming |
129 |
|
|
5.5.1 NAB Model |
129 |
|
|
5.5.1.1 Adaptive Filters |
130 |
|
|
5.5.1.2 Gated Feedback |
131 |
|
|
5.5.1.3 Regularization with MTL |
132 |
|
|
5.5.2 NAB Filter Analysis |
132 |
|
|
5.5.3 Results Summary |
133 |
|
|
5.6 Filtering in the Frequency Domain |
134 |
|
|
5.6.1 Factored Model |
134 |
|
|
5.6.1.1 Spatial Filtering |
134 |
|
|
5.6.1.2 Spectral Filtering: Complex Linear Projection |
134 |
|
|
5.6.2 NAB Model |
135 |
|
|
5.6.3 Results: Factored Model |
135 |
|
|
5.6.3.1 Performance |
135 |
|
|
5.6.3.2 Comparison Between Learning in Time vs. Frequency |
136 |
|
|
5.6.4 Results: Adaptive Model |
138 |
|
|
5.7 Final Comparison, Rerecorded Data |
138 |
|
|
5.8 Conclusions and Future Work |
139 |
|
|
References |
139 |
|
|
6 Novel Deep Architectures in Speech Processing |
142 |
|
|
6.1 Introduction |
143 |
|
|
6.1.1 Relationship to the Literature |
144 |
|
|
6.2 General Formulation of Deep Unfolding |
145 |
|
|
6.3 Unfolding Markov Random Fields |
147 |
|
|
6.3.1 Mean-Field Inference |
148 |
|
|
6.3.2 Belief Propagation |
150 |
|
|
6.4 Deep Nonnegative Matrix Factorization |
152 |
|
|
6.5 Multichannel Deep Unfolding |
155 |
|
|
6.5.1 Source Separation Using Multichannel Gaussian Mixture Model |
156 |
|
|
6.5.2 Unfolding the Multichannel Gaussian Mixture Model |
158 |
|
|
6.5.3 MRF Extension of the MCGMM |
159 |
|
|
6.5.4 Experiments and Discussion |
161 |
|
|
6.6 End-to-End Deep Clustering |
163 |
|
|
6.6.1 Deep-Clustering Model |
164 |
|
|
6.6.2 Optimizing Signal Reconstruction |
165 |
|
|
6.6.3 End-to-End Training |
166 |
|
|
6.6.4 Experiments |
167 |
|
|
6.6.4.1 ASR Performance |
167 |
|
|
6.7 Conclusion |
168 |
|
|
References |
168 |
|
|
7 Deep Recurrent Networks for Separation and Recognition of Single-Channel Speech in Nonstationary Background Audio |
172 |
|
|
7.1 Introduction |
172 |
|
|
7.2 Problem Description |
173 |
|
|
7.3 Learning-Free Methods |
175 |
|
|
7.4 Nonnegative Matrix Factorization |
176 |
|
|
7.5 Deep Learning for Source Separation |
177 |
|
|
7.5.1 Recurrent and Long Short-Term Memory Networks |
178 |
|
|
7.5.2 Mask Versus Signal Prediction |
179 |
|
|
7.5.2.1 Ideal Masks and Phase-Sensitive Mask |
179 |
|
|
7.5.2.2 Evaluating Ideal Masks |
180 |
|
|
7.5.3 Loss Functions and Inputs |
181 |
|
|
7.5.4 Phase-Sensitive Approximation Loss Function |
182 |
|
|
7.5.5 Inputs to the Network |
183 |
|
|
7.5.5.1 Spectral Features |
183 |
|
|
7.5.5.2 Speech-State Information |
183 |
|
|
7.5.5.3 Enhanced Features |
184 |
|
|
7.6 Experiments and Results |
185 |
|
|
7.6.1 Neural Network Training |
185 |
|
|
7.6.2 Results on CHiME-2 |
186 |
|
|
7.6.3 Discussion of Results |
191 |
|
|
7.7 Conclusion |
191 |
|
|
References |
191 |
|
|
8 Robust Features in Deep-Learning-Based Speech Recognition |
194 |
|
|
8.1 Introduction |
195 |
|
|
8.2 Background |
197 |
|
|
8.3 Approaches |
198 |
|
|
8.3.1 Speech Enhancement |
199 |
|
|
8.3.2 Signal-Theoretic Techniques |
200 |
|
|
8.3.3 Perceptually Motivated Features |
200 |
|
|
8.3.3.1 TempoRAl PatternS (TRAPS) |
202 |
|
|
8.3.3.2 Frequency-Domain Linear Prediction (FDLP) |
203 |
|
|
8.3.3.3 Power-Normalized Cepstral Coefficients (PNCC) |
204 |
|
|
8.3.3.4 Modulation Spectrum Features |
204 |
|
|
8.3.3.5 Normalized Modulation Coefficient (NMC) |
205 |
|
|
8.3.3.6 Modulation of Medium Duration Speech Amplitudes (MMeDuSA) |
207 |
|
|
8.3.3.7 Two Dimensional Modulation Extraction: Gabor Features |
209 |
|
|
8.3.3.8 Damped Oscillator Coefficient (DOC) |
210 |
|
|
8.3.4 Current Trends |
212 |
|
|
8.4 Case Studies |
214 |
|
|
8.4.1 Speech Processing for Noise- and Channel-Degraded Audio |
214 |
|
|
8.4.2 Speech Processing Under Reverberated Conditions |
215 |
|
|
8.5 Conclusion |
217 |
|
|
References |
218 |
|
|
9 Adaptation of Deep Neural Network Acoustic Models for Robust Automatic Speech Recognition |
225 |
|
|
9.1 Introduction |
225 |
|
|
9.1.1 DNN Adaptation Strategies |
226 |
|
|
9.1.1.1 Test-Time Adaptation |
227 |
|
|
9.1.1.2 Attribute-Aware Training |
227 |
|
|
9.1.1.3 Adaptive Training |
227 |
|
|
9.1.2 Overview of DNN Adaptation Methods |
228 |
|
|
9.1.2.1 Constrained Adaptation |
228 |
|
|
9.1.2.2 Feature Normalisation |
228 |
|
|
9.1.2.3 Feature Augmentation |
229 |
|
|
9.1.2.4 Structured DNN Parameterisation |
229 |
|
|
9.1.3 Chapter Organisation |
229 |
|
|
9.2 Feature Augmentation |
230 |
|
|
9.2.1 Speaker-Aware Training |
231 |
|
|
9.2.2 Noise-Aware Training |
232 |
|
|
9.2.3 Room-Aware Training |
233 |
|
|
9.2.4 Multiattribute-Aware Training |
234 |
|
|
9.2.5 Refinement of Augmented Features |
236 |
|
|
9.3 Structured DNN Parameterisation |
237 |
|
|
9.3.1 Structured Bias Vectors |
237 |
|
|
9.3.2 Structured Linear Transformation Adaptation |
238 |
|
|
9.3.3 Learning Hidden Unit Contribution |
239 |
|
|
9.3.4 SVD-Based Structure |
239 |
|
|
9.3.5 Factorised Hidden Layer Adaptation |
240 |
|
|
9.3.6 Cluster Adaptive Training for DNNs |
241 |
|
|
9.4 Summary and Future Directions |
243 |
|
|
References |
244 |
|
|
10 Training Data Augmentation and Data Selection |
250 |
|
|
10.1 Introduction |
250 |
|
|
10.1.1 Data Augmentation in the Literature |
251 |
|
|
10.1.2 Complementary Approaches |
252 |
|
|
10.2 Data Augmentation in Mismatched Environments |
253 |
|
|
10.2.1 Data Generation |
253 |
|
|
10.2.2 Speech Enhancement |
254 |
|
|
10.2.2.1 WPE-Based Dereverberation |
254 |
|
|
10.2.2.2 Denoising Autoencoder |
255 |
|
|
10.2.3 Results with Speech Enhancement on Test Data |
255 |
|
|
10.2.4 Results with Training Data Augmentation |
256 |
|
|
10.3 Data Selection |
257 |
|
|
10.3.1 Introduction |
257 |
|
|
10.3.2 Sequence-Summarizing Neural Network |
258 |
|
|
10.3.3 Configuration of the Neural Network |
260 |
|
|
10.3.4 Properties of the Extracted Vectors |
261 |
|
|
10.3.5 Results with Data Selection |
262 |
|
|
10.4 Conclusions |
263 |
|
|
References |
263 |
|
|
11 Advanced Recurrent Neural Networks for Automatic Speech Recognition |
266 |
|
|
11.1 Introduction |
266 |
|
|
11.2 Basic Deep Long Short-Term Memory RNNs |
267 |
|
|
11.2.1 Long Short-Term Memory RNNs |
267 |
|
|
11.2.2 Deep LSTM RNNs |
268 |
|
|
11.3 Prediction–Adaptation–Correction Recurrent Neural Networks |
268 |
|
|
11.4 Deep Long Short-Term Memory RNN Extensions |
270 |
|
|
11.4.1 Highway RNNs |
270 |
|
|
11.4.2 Bidirectional Highway LSTM RNNs |
272 |
|
|
11.4.3 Latency-Controlled Bidirectional Highway LSTM RNNs |
272 |
|
|
11.4.4 Grid LSTM RNNs |
274 |
|
|
11.4.5 Residual LSTM RNNs |
275 |
|
|
11.5 Experiment Setup |
275 |
|
|
11.5.1 Corpus |
275 |
|
|
11.5.1.1 IARPA-Babel Corpus |
275 |
|
|
11.5.1.2 AMI Meeting Corpus |
275 |
|
|
11.5.2 System Description |
276 |
|
|
11.6 Evaluation |
277 |
|
|
11.6.1 PAC-RNN |
277 |
|
|
11.6.1.1 Low-Resource Language |
277 |
|
|
11.6.1.2 Distant Speech Recognition |
278 |
|
|
11.6.2 Highway LSTMP |
279 |
|
|
11.6.2.1 Three-Layer Highway (B)LSTMP |
279 |
|
|
11.6.2.2 Highway (B)LSTMP with Dropout |
279 |
|
|
11.6.2.3 Deeper Highway LSTMP |
280 |
|
|
11.6.2.4 Grid LSTMP |
280 |
|
|
11.6.2.5 Residual LSTMP |
281 |
|
|
11.6.2.6 Summary of Results |
281 |
|
|
11.7 Conclusion |
282 |
|
|
References |
283 |
|
|
12 Sequence-Discriminative Training of Neural Networks |
285 |
|
|
12.1 Introduction |
285 |
|
|
12.2 Training Criteria |
287 |
|
|
12.2.1 Maximum Mutual Information |
287 |
|
|
12.2.2 Boosted Maximum Mutual Information |
288 |
|
|
12.2.3 Minimum Phone Error/State-Level Minimum Bayes Risk |
289 |
|
|
12.3 Practical Training Strategy |
290 |
|
|
12.3.1 Criterion Selection |
290 |
|
|
12.3.2 Frame-Smoothing |
291 |
|
|
12.3.3 Lattice Generation |
292 |
|
|
12.3.3.1 Numerator Lattice |
292 |
|
|
12.3.3.2 Denominator Lattice |
293 |
|
|
12.4 Two-Forward-Pass Method for Sequence Training |
294 |
|
|
12.5 Experiment Setup |
295 |
|
|
12.5.1 Corpus |
296 |
|
|
12.5.2 System Description |
296 |
|
|
12.6 Evaluation |
297 |
|
|
12.6.1 Practical Strategy |
297 |
|
|
12.6.2 Two-Forward-Pass Method |
297 |
|
|
12.6.2.1 Speed |
298 |
|
|
12.6.2.2 Performance |
298 |
|
|
12.7 Conclusion |
299 |
|
|
References |
300 |
|
|
13 End-to-End Architectures for Speech Recognition |
302 |
|
|
13.1 Introduction |
302 |
|
|
13.1.1 Complexity and Suboptimality of the Conventional ASR Pipeline |
303 |
|
|
13.1.2 Simplification of the Conventional ASR Pipeline |
305 |
|
|
13.1.3 End-to-End Learning |
306 |
|
|
13.2 End-to-End ASR Architectures |
306 |
|
|
13.2.1 Connectionist Temporal Classification |
307 |
|
|
13.2.2 Encoder–Decoder Paradigm |
307 |
|
|
13.2.3 Learning the Front End |
309 |
|
|
13.2.4 Other Ideas |
310 |
|
|
13.3 The EESEN Framework |
310 |
|
|
13.3.1 Model Structure |
311 |
|
|
13.3.2 Model Training |
312 |
|
|
13.3.3 Decoding |
314 |
|
|
13.3.3.1 Grammar |
315 |
|
|
13.3.3.2 Lexicon |
315 |
|
|
13.3.3.3 Token |
316 |
|
|
13.3.3.4 Search Graph |
316 |
|
|
13.3.4 Experiments and Analysis |
317 |
|
|
13.3.4.1 Wall Street Journal |
317 |
|
|
13.3.4.2 Switchboard |
319 |
|
|
13.3.4.3 HKUST Mandarin Chinese |
320 |
|
|
13.4 Summary and Future Directions |
321 |
|
|
References |
322 |
|
|
Part III Resources |
327 |
|
|
14 The CHiME Challenges: Robust Speech Recognition in Everyday Environments |
328 |
|
|
14.1 Introduction |
328 |
|
|
14.2 The 1st and 2nd CHiME Challenges (CHiME-1 and CHiME-2) |
329 |
|
|
14.2.1 Domestic Noise Background |
330 |
|
|
14.2.2 The Speech Recognition Task Design |
330 |
|
|
14.2.2.1 CHiME-1: Small Vocabulary |
331 |
|
|
14.2.2.2 CHiME-2 Track 1: Simulated Motion |
331 |
|
|
14.2.2.3 CHiME-2 Track 2: Medium Vocabulary |
332 |
|
|
14.2.3 Overview of System Performance |
332 |
|
|
14.2.4 Interim Conclusions |
333 |
|
|
14.3 The 3rd CHiME Challenge (CHiME-3) |
334 |
|
|
14.3.1 The Mobile Tablet Recordings |
334 |
|
|
14.3.2 The CHiME-3 Task Design: Real and Simulated Data |
335 |
|
|
14.3.3 The CHiME-3 Baseline Systems |
336 |
|
|
14.3.3.1 Simulation |
336 |
|
|
14.3.3.2 Enhancement |
336 |
|
|
14.3.3.3 ASR |
337 |
|
|
14.4 The CHiME-3 Evaluations |
337 |
|
|
14.4.1 An Overview of CHiME-3 System Performance |
338 |
|
|
14.4.2 An Overview of Successful Strategies |
338 |
|
|
14.4.2.1 Strategies for Improved Signal Enhancement |
339 |
|
|
14.4.2.2 Strategies for Improved Statistical Modelling |
339 |
|
|
14.4.2.3 Strategies for Improved System Training |
340 |
|
|
14.4.3 Key Findings |
340 |
|
|
14.5 Future Directions: CHiME-4 and Beyond |
341 |
|
|
References |
343 |
|
|
15 The REVERB Challenge: A Benchmark Task for Reverberation-Robust ASR Techniques |
346 |
|
|
15.1 Introduction |
347 |
|
|
15.2 Challenge Scenarios, Data, and Regulations |
348 |
|
|
15.2.1 Scenarios Assumed in the Challenge |
348 |
|
|
15.2.2 Data |
348 |
|
|
15.2.2.1 Test Data: Dev and Eval Test Sets |
348 |
|
|
15.2.2.2 Training Data |
350 |
|
|
15.2.3 Regulations |
350 |
|
|
15.3 Performance of Baseline and Top-Performing Systems |
351 |
|
|
15.3.1 Benchmark Results with GMM-HMM and DNN-HMM Systems |
351 |
|
|
15.3.2 Top-Performing 1-ch and 8-ch Systems |
352 |
|
|
15.3.3 Current State-of-the-Art Performance |
353 |
|
|
15.4 Summary and Remaining Challenges for Reverberant Speech Recognition |
354 |
|
|
References |
354 |
|
|
16 Distant Speech Recognition Experiments Using the AMI Corpus |
356 |
|
|
16.1 Introduction |
356 |
|
|
16.2 Meeting Corpora |
357 |
|
|
16.3 Baseline Speech Recognition Experiments |
359 |
|
|
16.4 Channel Concatenation Experiments |
362 |
|
|
16.5 Convolutional Neural Networks |
363 |
|
|
16.5.1 SDM Recordings |
365 |
|
|
16.5.2 MDM Recordings |
365 |
|
|
16.5.3 IHM Recordings |
366 |
|
|
16.6 Discussion and Conclusions |
367 |
|
|
References |
367 |
|
|
17 Toolkits for Robust Speech Processing |
370 |
|
|
17.1 Introduction |
370 |
|
|
17.2 General Speech Recognition Toolkits |
371 |
|
|
17.3 Language Model Toolkits |
373 |
|
|
17.4 Speech Enhancement Toolkits |
375 |
|
|
17.5 Deep Learning Toolkits |
376 |
|
|
17.6 End-to-End Speech Recognition Toolkits |
378 |
|
|
17.7 Other Resources for Speech Technology |
380 |
|
|
17.8 Conclusion |
380 |
|
|
References |
381 |
|
|
Part IV Applications |
384 |
|
|
18 Speech Research at Google to Enable Universal Speech Interfaces |
385 |
|
|
18.1 Early Development |
385 |
|
|
18.2 Voice Search |
387 |
|
|
18.3 Text to Speech |
387 |
|
|
18.4 Dictation/IME/Transcription |
388 |
|
|
18.5 Internationalization |
389 |
|
|
18.6 Neural-Network-Based Acoustic Modeling |
391 |
|
|
18.7 Adaptive Language Modeling |
392 |
|
|
18.8 Mobile-Device-Specific Technology |
393 |
|
|
18.9 Robustness |
395 |
|
|
References |
396 |
|
|
19 Challenges in and Solutions to Deep Learning Network Acoustic Modeling in Speech Recognition Products at Microsoft |
400 |
|
|
19.1 Introduction |
401 |
|
|
19.2 Effective and Efficient DL Modeling |
401 |
|
|
19.2.1 Reducing Run-Time Cost with SVD-Based Training |
402 |
|
|
19.2.2 Speaker Adaptation on Small Amount of Parameters |
402 |
|
|
19.2.2.1 SVD Bottleneck Adaptation |
403 |
|
|
19.2.2.2 DNN Adaptation Through Activation Function |
404 |
|
|
19.2.2.3 Low-Rank Plus Diagonal (LRPD) Adaptation |
404 |
|
|
19.2.3 Improving the Accuracy of Small-Size DNNs with Teacher–Student Training |
405 |
|
|
19.3 Invariance Modeling |
406 |
|
|
19.3.1 Improving the Robustness to Accent/Dialect with Model Adaptation |
406 |
|
|
19.3.2 Improving the Robustness to Acoustic Environment with Variable-Component DNN Modeling |
408 |
|
|
19.3.3 Improving the Time and Frequency Invariance with Time–Frequency Long Short-Term Memory RNNs |
409 |
|
|
19.3.4 Exploring the Generalization Capability to Unseen Data with Maximum Margin Sequence Training |
409 |
|
|
19.4 Effective Training-Data Usage |
411 |
|
|
19.4.1 Use of Unsupervised Data to Improve SR Accuracy |
411 |
|
|
19.4.2 Expanded Language Capability by Reusing Speech-Training Material Across Languages |
412 |
|
|
19.5 Conclusion |
413 |
|
|
References |
414 |
|
|
20 Advanced ASR Technologies for Mitsubishi Electric Speech Applications |
417 |
|
|
20.1 Introduction |
417 |
|
|
20.2 ASR for Car Navigation Systems |
418 |
|
|
20.2.1 Introduction |
418 |
|
|
20.2.2 ASR and Postprocessing Technologies |
418 |
|
|
20.2.2.1 ASR Using Statistical LM |
418 |
|
|
20.2.2.2 POI Name Search Using High-Speed Text Search Technique |
419 |
|
|
20.2.2.3 Application to Commercial Car Navigation System |
420 |
|
|
20.3 Dereverberation for Hands-Free Elevator |
420 |
|
|
20.3.1 Introduction |
420 |
|
|
20.3.2 A Dereverberation Method Using SS |
421 |
|
|
20.3.3 Experiments |
422 |
|
|
20.4 Discriminative Methods |
423 |
|
|
20.4.1 Introduction |
423 |
|
|
20.4.2 Discriminative Training for AMs |
424 |
|
|
20.4.3 Discriminative Training for RNN-LM |
425 |
|
|
20.5 Conclusion |
426 |
|
|
References |
427 |
|
|
Index |
428 |
|