|
Preface |
6 |
|
|
Contents |
10 |
|
|
Part I Overview of Speech and Audio Coding |
12 |
|
|
1 From “Harmonic Telegraph” to Cellular Phones |
13 |
|
|
1.1 Introduction |
13 |
|
|
1.1.1 The Multiple Telegraph “Harmonic Telegraph” |
14 |
|
|
1.1.2 Bell's Theory of Transmitting Speech |
14 |
|
|
1.2 Early History of the Telephone |
15 |
|
|
1.2.1 The Telephone Is Born |
15 |
|
|
1.2.2 Birth of the Telephone Company |
15 |
|
|
1.2.2.1 Research at Bell Company |
16 |
|
|
1.2.2.2 New York to San Francisco Telephone Service in 1915, Nobel Prize, and More |
16 |
|
|
1.3 Speech Bandwidth Compression at AT&T |
17 |
|
|
1.3.1 Early Research on “vocoders” |
17 |
|
|
1.3.2 Predictive Coding |
18 |
|
|
1.3.3 Efficient Encoding of Prediction Error |
19 |
|
|
1.3.3.1 Some Comments on the Nature of Prediction Error for Speech |
19 |
|
|
1.3.3.2 Information Rate of Gaussian Signals with Specified Fidelity Criterion |
20 |
|
|
1.3.3.3 Predictive Coding with Specified Error Spectrum |
20 |
|
|
1.3.3.4 Overcoming the Computational Complexity of Predictive Coders |
22 |
|
|
1.4 Cellular Telephone Service |
24 |
|
|
1.4.1 Digital Cellular Standards |
25 |
|
|
1.4.1.1 North American Digital Cellular Standards |
25 |
|
|
1.4.1.2 European Digital Cellular Standards |
25 |
|
|
1.5 The Future |
26 |
|
|
References |
26 |
|
|
2 Challenges in Speech Coding Research |
28 |
|
|
2.1 Introduction |
28 |
|
|
2.2 Speech Coding |
29 |
|
|
2.2.1 Speech Coding Methods |
30 |
|
|
2.2.1.1 Waveform Coding [2] |
30 |
|
|
2.2.1.2 Subband and Transform Methods [2] |
31 |
|
|
2.2.1.3 Analysis-by-Synthesis Methods [2, 10] |
32 |
|
|
2.2.1.4 Postfiltering [11] |
35 |
|
|
2.2.1.5 Voice Activity Detection and Silence Coding |
35 |
|
|
2.2.2 Speech Coding Standards |
35 |
|
|
2.2.2.1 ITU-T Standards |
36 |
|
|
2.2.2.2 Digital Cellular Standards |
38 |
|
|
2.2.2.3 VoIP Standards |
40 |
|
|
2.3 Audio Coding [25, 26] |
40 |
|
|
2.4 Newer Standards |
41 |
|
|
2.5 Emerging Topics |
45 |
|
|
2.6 Conclusions and Future Research Directions |
46 |
|
|
References |
46 |
|
|
3 Scalable and Multi-Rate Speech Coding for Voice-over-Internet Protocol (VoIP) Networks |
49 |
|
|
3.1 Introduction |
49 |
|
|
3.2 VoIP Networks |
50 |
|
|
3.2.1 Overview of VoIP Networks |
51 |
|
|
3.2.2 Robust Voice Communication |
51 |
|
|
3.2.3 Packet Loss Concealment (PLC) |
51 |
|
|
3.3 Analysis-by-Synthesis Speech Coding |
53 |
|
|
3.3.1 Analysis-by-Synthesis Principles |
53 |
|
|
3.3.2 CELP-Based Coders |
53 |
|
|
3.3.2.1 Perceptual Error Weighting |
56 |
|
|
3.3.2.2 Pitch Estimation |
56 |
|
|
3.4 Multi-Rate Speech Coding |
57 |
|
|
3.4.1 Basic Principles |
57 |
|
|
3.4.2 Adaptive Multi-Rate (AMR) Codec |
59 |
|
|
3.5 Scalable Speech Coding |
60 |
|
|
3.5.1 Basic Principles |
60 |
|
|
3.5.2 Standardized Scalable Speech Codecs |
60 |
|
|
3.5.2.1 ITU-T G.729.1 |
61 |
|
|
3.5.2.2 ITU-T G.718 |
62 |
|
|
3.6 Packet-Loss Robust Speech Coding |
65 |
|
|
3.6.1 Internet Low Bitrate Codec (iLBC) |
67 |
|
|
3.6.2 Scalable Multi-Rate Speech Codec |
68 |
|
|
3.6.2.1 Narrowband Codec |
68 |
|
|
3.6.2.2 Wideband Codec |
74 |
|
|
3.7 Conclusions |
79 |
|
|
References |
79 |
|
|
4 Recent Speech Coding Technologies and Standards |
83 |
|
|
4.1 Recent Speech Codec Technologies and Features |
84 |
|
|
4.1.1 Active Speech Source-Controlled Variable Bit Rate, Constant Bit Rate Operation and Voice Activity Detectors |
84 |
|
|
4.1.1.1 Source-Controlled Variable Bit Rate (SC-VBR) Versus Constant/Fixed Bit Rate (CBR) Vocoders |
85 |
|
|
4.1.2 Layered Coding |
86 |
|
|
4.1.3 Bandwidth Extension of Speech |
87 |
|
|
4.1.3.1 Harmonic Bandwidth Extension Architecture |
88 |
|
|
4.1.3.2 Spectral Band Replication (SBR) |
89 |
|
|
4.1.4 Blind Bandwidth Extension |
90 |
|
|
4.1.4.1 High Band Model and Prediction Methods |
91 |
|
|
4.1.4.2 BBE for Speech Coding |
91 |
|
|
4.1.4.3 BBE for Bandwidth Increase |
92 |
|
|
4.1.4.4 Quality Evaluation |
92 |
|
|
4.1.4.5 Encoder Based BBE |
93 |
|
|
4.1.5 Packet Loss Concealment |
95 |
|
|
4.1.5.1 Code Excited Linear Prediction Coders |
96 |
|
|
4.1.5.2 Adaptive Differential Pulse Code Modulation (ADPCM) Based Coders |
97 |
|
|
4.1.6 Voice Over Internet Protocol (VoIP) |
97 |
|
|
4.1.6.1 Management of Time Varying Delay |
98 |
|
|
4.1.6.2 Packet Loss Concealment for VoIP |
99 |
|
|
4.2 Recent Speech Coding Standards |
102 |
|
|
4.2.1 Advanced Standards in ITU-T |
102 |
|
|
4.2.1.1 G.729.1: Scalable Extension of G.729 |
103 |
|
|
4.2.1.2 G.718: Layered Coder with Interoperable Modes |
104 |
|
|
4.2.1.3 Super-Wideband Extensions: G.729.1 Annex E and G.718 Annex B |
104 |
|
|
4.2.1.4 G.711.1: Scalable Wideband Extension of G.711 |
105 |
|
|
4.2.1.5 Super-Wideband and Stereo Extensions of G.711.1 and G.722 |
105 |
|
|
4.2.1.6 Full-Band Coding in G.719 |
107 |
|
|
4.2.1.7 G.711.0 Lossless Coding |
107 |
|
|
4.2.1.8 Packet Loss Concealment Algorithms for G.711 and G.722 |
108 |
|
|
4.2.2 IETF Codecs and Transport Protocols |
108 |
|
|
4.2.2.1 Opus Codec |
108 |
|
|
Audio Bandwidths and Bit Rate Sweet Spots |
109 |
|
|
Variable and Constant Bit Rate Modes of Operation |
109 |
|
|
Mono and Stereo Coding |
110 |
|
|
Packet Loss Resilience |
110 |
|
|
Forward Error Correction (Low Bit Rate Redundancy) |
110 |
|
|
4.2.2.2 RTP Payload Formats |
110 |
|
|
4.2.3 3GPP and the Enhanced Voice Services (EVS) Codec |
111 |
|
|
4.2.4 Recent Codec Development in 3GPP2 |
112 |
|
|
4.2.5 Conversational Codecs in MPEG |
113 |
|
|
References |
115 |
|
|
Part II Review and Challenges in Speech, Speaker and Emotion Recognition |
118 |
|
|
5 Ensemble Learning Approaches in Speech Recognition |
119 |
|
|
5.1 Introduction |
119 |
|
|
5.2 Background of Ensemble Methods in Machine Learning |
120 |
|
|
5.2.1 Ensemble Learning |
120 |
|
|
5.2.2 Boosting |
121 |
|
|
5.2.3 Bagging |
122 |
|
|
5.2.4 Random Forest |
122 |
|
|
5.2.5 Classifier Combination |
123 |
|
|
5.2.6 Ensemble Error Analyses |
124 |
|
|
5.2.6.1 Added Error of an Ensemble Classifier |
124 |
|
|
5.2.6.2 Bias–Variance–Covariance Decomposition |
124 |
|
|
5.2.6.3 Error-Ambiguity Decomposition |
125 |
|
|
5.2.7 Diversity Measures |
126 |
|
|
5.2.8 Ensemble Pruning |
126 |
|
|
5.2.9 Ensemble Clustering |
127 |
|
|
5.3 Background of Speech Recognition |
127 |
|
|
5.3.1 State-of-the-Art Speech Recognition System Architecture |
127 |
|
|
5.3.2 Front-End Processing |
128 |
|
|
5.3.3 Lexicon |
129 |
|
|
5.3.4 Acoustic Model |
129 |
|
|
5.3.5 Language Model |
130 |
|
|
5.3.6 Decoding Search |
131 |
|
|
5.4 Generating and Combining Diversity in Speech Recognition |
132 |
|
|
5.4.1 System Places for Generating Diversity |
132 |
|
|
5.4.1.1 Front End Processing |
132 |
|
|
5.4.1.2 Acoustic Model |
132 |
|
|
5.4.1.3 Language Model |
133 |
|
|
5.4.2 System Levels for Utilizing Diversity |
133 |
|
|
5.4.2.1 Utterance Level Combination |
134 |
|
|
5.4.2.2 Word Level Combination |
134 |
|
|
5.4.2.3 Subword Level Combination |
136 |
|
|
5.4.2.4 State Level Combination |
136 |
|
|
5.4.2.5 Feature Level Combination |
139 |
|
|
5.5 Ensemble Learning Techniques for Acoustic Modeling |
139 |
|
|
5.5.1 Explicit Diversity Generation |
140 |
|
|
5.5.1.1 Boosting |
140 |
|
|
5.5.1.2 Minimum Bayes Risk Leveraging (MBRL) |
143 |
|
|
5.5.1.3 Directed Decision Trees |
144 |
|
|
5.5.1.4 Deep Stacking Network |
144 |
|
|
5.5.2 Implicit Diversity Generation |
145 |
|
|
5.5.2.1 Multiple Systems and Multiple Models |
145 |
|
|
5.5.2.2 Random Forest |
146 |
|
|
5.5.2.3 Data Sampling |
147 |
|
|
5.6 Ensemble Learning Techniques for Language Modeling |
148 |
|
|
5.7 Performance Enhancing Mechanism of Ensemble Learning |
149 |
|
|
5.7.1 Classification Margin |
149 |
|
|
5.7.2 Diversity |
150 |
|
|
5.7.3 Bias and Variance |
151 |
|
|
5.8 Compacting Ensemble Models to Improve Efficiency |
152 |
|
|
5.8.1 Model Clustering |
153 |
|
|
5.8.2 Density Matching |
153 |
|
|
5.9 Conclusion |
154 |
|
|
References |
155 |
|
|
6 Deep Dynamic Models for Learning Hidden Representations of Speech Features |
159 |
|
|
6.1 Introduction |
160 |
|
|
6.2 Generative Deep-Structured Speech Dynamics: Model Formulation |
161 |
|
|
6.2.1 Generative Learning in Speech Recognition |
161 |
|
|
6.2.2 A Hidden Dynamic Model with Nonlinear Observation Equation |
166 |
|
|
6.2.3 A Linear Hidden Dynamic Model Amenable to Variational EM Training |
167 |
|
|
6.3 Generative Deep-Structured Speech Dynamics: Model Estimation |
169 |
|
|
6.3.1 Learning a Hidden Dynamic Model Using the Extended Kalman Filter |
169 |
|
|
6.3.1.1 E-Step |
169 |
|
|
6.3.1.2 M-Step |
170 |
|
|
6.3.2 Learning a Hidden Dynamic Model Using Variational EM |
172 |
|
|
6.3.2.1 Model Inference and Learning |
172 |
|
|
6.3.2.2 The GMM Posterior |
172 |
|
|
6.3.2.3 The HMM Posterior |
173 |
|
|
6.4 Discriminative Deep Neural Networks Aided by Generative Pre-training |
175 |
|
|
6.4.1 Restricted Boltzmann Machines |
176 |
|
|
6.4.2 Stacking Up RBMs to Form a DBN |
178 |
|
|
6.4.3 Interfacing the DNN with an HMM to Incorporate Sequential Dynamics |
180 |
|
|
6.5 Recurrent Neural Networks for Discriminative Modeling of Speech Dynamics |
181 |
|
|
6.5.1 RNNs Expressed in the State-Space Formalism |
182 |
|
|
6.5.2 The BPTT Learning Algorithm |
183 |
|
|
6.5.3 The EKF Learning Algorithm |
186 |
|
|
6.6 Comparing Two Types of Dynamic Models |
187 |
|
|
6.6.1 Top-Down Versus Bottom-Up |
187 |
|
|
6.6.1.1 Top-Down Generative Hidden Dynamic Modeling |
187 |
|
|
6.6.1.2 Bottom-Up Discriminative Recurrent Neural Networks and the ``Generative'' Counterpart |
188 |
|
|
6.6.2 Localist Versus Distributed Representations |
190 |
|
|
6.6.3 Latent Explanatory Variables Versus End-to-End Discriminative Learning |
191 |
|
|
6.6.4 Parsimonious Versus Massive Parameters |
192 |
|
|
6.6.5 Comparing Recognition Accuracy of the Two Types of Models |
194 |
|
|
6.7 Summary and Discussions on Future Directions |
194 |
|
|
References |
196 |
|
|
7 Speech Based Emotion Recognition |
202 |
|
|
7.1 Introduction |
202 |
|
|
7.1.1 What Are Emotions? |
203 |
|
|
7.1.2 Emotion Labels |
205 |
|
|
7.1.3 The Emotion Recognition Task |
207 |
|
|
7.2 Emotion Classification Systems |
208 |
|
|
7.2.1 Short-Term Features |
209 |
|
|
7.2.1.1 Pitch |
209 |
|
|
7.2.1.2 Loudness/Energy |
209 |
|
|
7.2.1.3 Spectral Features |
210 |
|
|
7.2.1.4 Cepstral Features |
210 |
|
|
7.2.2 High Dimensional Representation |
210 |
|
|
7.2.2.1 Functional Approach to a High-Dimensional Representation |
211 |
|
|
7.2.2.2 GMM Supervector Approach to High-Dimensional Representation |
212 |
|
|
7.2.3 Modelling Emotions |
213 |
|
|
7.2.3.1 Emotion Models: Linear Support Vector Machines |
214 |
|
|
7.2.3.2 Emotion Models: Nonlinear Support Vector Machines |
215 |
|
|
7.2.4 Alternative Emotion Modelling Methodologies |
216 |
|
|
7.2.4.1 Supra-Frame Level Feature |
217 |
|
|
7.2.4.2 Dynamic Emotion Models |
218 |
|
|
7.3 Dealing with Variability |
219 |
|
|
7.3.1 Phonetic Variability in Emotion Recognition Systems |
219 |
|
|
7.3.2 Speaker Variability |
221 |
|
|
7.3.2.1 Speaker Normalisation |
222 |
|
|
7.3.2.2 Speaker Adaptation |
223 |
|
|
7.4 Comparing Systems |
224 |
|
|
7.5 Conclusions |
226 |
|
|
References |
228 |
|
|
8 Speaker Diarization: An Emerging Research |
234 |
|
|
8.1 Overview |
234 |
|
|
8.2 Signal Processing |
235 |
|
|
8.2.1 Wiener Filtering |
236 |
|
|
8.2.2 Acoustic Beamforming |
236 |
|
|
8.3 Feature Extraction |
237 |
|
|
8.3.1 Acoustic Features |
237 |
|
|
8.3.1.1 Short-Term Spectral Features |
238 |
|
|
8.3.1.2 Prosodic Features |
239 |
|
|
8.3.2 Sound Source Features |
239 |
|
|
8.3.3 Feature Normalization Techniques |
241 |
|
|
8.3.3.1 RASTA Filtering |
241 |
|
|
8.3.3.2 Cepstral Mean Normalization |
242 |
|
|
8.3.3.3 Feature Warping |
242 |
|
|
8.4 Speech Activity Detection |
242 |
|
|
8.4.1 Energy-Based Speech Detection |
243 |
|
|
8.4.2 Model Based Speech Detection |
243 |
|
|
8.4.3 Hybrid Speech Detection |
243 |
|
|
8.4.4 Multi-Channel Speech Detection |
245 |
|
|
8.5 Clustering Architecture |
245 |
|
|
8.5.1 Speaker Modeling |
247 |
|
|
8.5.1.1 Gaussian Mixture Model |
247 |
|
|
8.5.1.2 Hidden Markov Model |
248 |
|
|
8.5.1.3 Total Factor Vector |
249 |
|
|
8.5.1.4 Other Modeling Approaches |
250 |
|
|
8.5.2 Distance Measures |
251 |
|
|
8.5.2.1 Symmetric Kullback-Leibler Distance |
252 |
|
|
8.5.2.2 Divergence Shape Distance |
253 |
|
|
8.5.2.3 Arithmetic Harmonic Sphericity |
253 |
|
|
8.5.2.4 Generalized Likelihood Ratio |
253 |
|
|
8.5.2.5 Bayesian Information Criterion |
254 |
|
|
8.5.2.6 Cross Likelihood Ratio |
256 |
|
|
8.5.2.7 Normalized Cross Likelihood Ratio |
256 |
|
|
8.5.2.8 Other Distance Measures |
256 |
|
|
8.5.3 Speaker Segmentation |
257 |
|
|
8.5.3.1 Silence Detection Based Methods |
257 |
|
|
8.5.3.2 Metric-Based Segmentation |
258 |
|
|
8.5.3.3 Hybrid Segmentation |
260 |
|
|
8.5.3.4 Segmentation Evaluation |
260 |
|
|
8.5.4 Speaker Clustering |
261 |
|
|
8.5.4.1 Agglomerative Hierarchical Clustering |
261 |
|
|
8.5.4.2 Divisive Hierarchical Clustering |
266 |
|
|
8.5.4.3 Other Approaches |
267 |
|
|
8.5.4.4 Multiple Systems Combination |
268 |
|
|
8.5.5 Online Speaker Clustering |
268 |
|
|
Segmentation. |
268 |
|
|
Novelty Detection. |
269 |
|
|
Speaker Modeling. |
270 |
|
|
8.5.5.1 Speaker Clustering Evaluation |
270 |
|
|
8.6 Speaker Diarization Evaluation |
272 |
|
|
8.7 Databases for Speaker Diarization in Meeting |
272 |
|
|
8.8 Related Projects in Meeting Room |
273 |
|
|
8.9 NIST Rich Transcription Benchmarks |
273 |
|
|
8.10 Summary |
274 |
|
|
References |
275 |
|
|
Part III Current Trends in Speech Enhancement |
283 |
|
|
9 Maximum A Posteriori Spectral Estimation with Source Log-Spectral Priors for Multichannel Speech Enhancement |
284 |
|
|
9.1 Introduction |
285 |
|
|
9.2 Signal Representation and Modeling for Multichannel Speech Enhancement |
287 |
|
|
9.2.1 General Speech Capture Scenario for Multichannel Speech Enhancement |
287 |
|
|
9.2.2 Time-Frequency Domain Representation of Signals |
289 |
|
|
9.2.3 Generative Model of Desired Signals |
290 |
|
|
9.2.4 Generative Model of Interference |
292 |
|
|
9.3 Speech Enhancement Based on Maximum Likelihood Spectral Estimation (MLSE) |
293 |
|
|
9.3.1 Maximum Likelihood Spectral Estimation (MLSE) |
293 |
|
|
9.3.2 Processing Flow of MLSE Based Speech Enhancement |
294 |
|
|
9.4 Speech Enhancement Based on Maximum A Posteriori Spectral Estimation (MAPSE) |
295 |
|
|
9.4.1 Maximum A Posteriori Spectral Estimation (MAPSE) |
296 |
|
|
9.4.2 Log-Spectral Prior of Speech |
297 |
|
|
9.4.3 Expectation Maximization (EM) Algorithm |
299 |
|
|
9.4.4 Update of n,f Based on Newton–Raphson Method |
301 |
|
|
9.4.5 Processing Flow |
302 |
|
|
9.5 Application to Blind Source Separation (BSS) |
303 |
|
|
9.5.1 MLSE for BSS (ML-BSS) |
303 |
|
|
9.5.1.1 Generative Models for ML-BSS |
304 |
|
|
9.5.1.2 MLSE Based on EM Algorithm |
305 |
|
|
9.5.1.3 Processing Flow of ML-BSS Based on EM Algorithm |
307 |
|
|
9.5.2 MAPSE for BSS (MAP-BSS) |
308 |
|
|
9.5.2.1 Generative Models for MAP-BSS |
308 |
|
|
9.5.2.2 MAPSE Based on EM Algorithm |
309 |
|
|
9.5.2.3 Processing Flow of MAP-BSS Based on EM Algorithm |
311 |
|
|
9.5.2.4 Initialization of and (or ) |
312 |
|
|
9.6 Experiments |
313 |
|
|
9.6.1 Evaluation 1 with Aurora-2 Speech Database |
313 |
|
|
9.6.2 Evaluation 2 with SiSEC Database |
316 |
|
|
9.7 Concluding Remarks |
318 |
|
|
References |
318 |
|
|
10 Modulation Processing for Speech Enhancement |
321 |
|
|
10.1 Introduction |
322 |
|
|
10.2 Methods |
324 |
|
|
10.2.1 Modulation AMS-Based Framework |
324 |
|
|
10.2.2 Modulation Spectral Subtraction |
327 |
|
|
10.2.3 MMSE Modulation Magnitude Estimation |
330 |
|
|
10.2.3.1 MMSE Modulation Magnitude Estimation with SPU |
333 |
|
|
10.2.3.2 MMSE Log-Modulation Magnitude Estimation |
333 |
|
|
10.2.3.3 MME Parameters |
334 |
|
|
10.3 Speech Quality Assessment |
334 |
|
|
10.4 Evaluation of Short-Time Modulation-Domain Based Methods with Respect to Quality |
335 |
|
|
10.5 Conclusion |
342 |
|
|
References |
344 |
|