ciando eBooks - ein Service Ihrer Bibliothek

	Preface	6
	Contents	10
	Part I Overview of Speech and Audio Coding	12
	1 From “Harmonic Telegraph” to Cellular Phones	13
	1.1 Introduction	13
	1.1.1 The Multiple Telegraph “Harmonic Telegraph”	14
	1.1.2 Bell's Theory of Transmitting Speech	14
	1.2 Early History of the Telephone	15
	1.2.1 The Telephone Is Born	15
	1.2.2 Birth of the Telephone Company	15
	1.2.2.1 Research at Bell Company	16
	1.2.2.2 New York to San Francisco Telephone Service in 1915, Nobel Prize, and More	16
	1.3 Speech Bandwidth Compression at AT&T	17
	1.3.1 Early Research on “vocoders”	17
	1.3.2 Predictive Coding	18
	1.3.3 Efficient Encoding of Prediction Error	19
	1.3.3.1 Some Comments on the Nature of Prediction Error for Speech	19
	1.3.3.2 Information Rate of Gaussian Signals with Specified Fidelity Criterion	20
	1.3.3.3 Predictive Coding with Specified Error Spectrum	20
	1.3.3.4 Overcoming the Computational Complexity of Predictive Coders	22
	1.4 Cellular Telephone Service	24
	1.4.1 Digital Cellular Standards	25
	1.4.1.1 North American Digital Cellular Standards	25
	1.4.1.2 European Digital Cellular Standards	25
	1.5 The Future	26
	References	26
	2 Challenges in Speech Coding Research	28
	2.1 Introduction	28
	2.2 Speech Coding	29
	2.2.1 Speech Coding Methods	30
	2.2.1.1 Waveform Coding [2]	30
	2.2.1.2 Subband and Transform Methods [2]	31
	2.2.1.3 Analysis-by-Synthesis Methods [2, 10]	32
	2.2.1.4 Postfiltering [11]	35
	2.2.1.5 Voice Activity Detection and Silence Coding	35
	2.2.2 Speech Coding Standards	35
	2.2.2.1 ITU-T Standards	36
	2.2.2.2 Digital Cellular Standards	38
	2.2.2.3 VoIP Standards	40
	2.3 Audio Coding [25, 26]	40
	2.4 Newer Standards	41
	2.5 Emerging Topics	45
	2.6 Conclusions and Future Research Directions	46
	References	46
	3 Scalable and Multi-Rate Speech Coding for Voice-over-Internet Protocol (VoIP) Networks	49
	3.1 Introduction	49
	3.2 VoIP Networks	50
	3.2.1 Overview of VoIP Networks	51
	3.2.2 Robust Voice Communication	51
	3.2.3 Packet Loss Concealment (PLC)	51
	3.3 Analysis-by-Synthesis Speech Coding	53
	3.3.1 Analysis-by-Synthesis Principles	53
	3.3.2 CELP-Based Coders	53
	3.3.2.1 Perceptual Error Weighting	56
	3.3.2.2 Pitch Estimation	56
	3.4 Multi-Rate Speech Coding	57
	3.4.1 Basic Principles	57
	3.4.2 Adaptive Multi-Rate (AMR) Codec	59
	3.5 Scalable Speech Coding	60
	3.5.1 Basic Principles	60
	3.5.2 Standardized Scalable Speech Codecs	60
	3.5.2.1 ITU-T G.729.1	61
	3.5.2.2 ITU-T G.718	62
	3.6 Packet-Loss Robust Speech Coding	65
	3.6.1 Internet Low Bitrate Codec (iLBC)	67
	3.6.2 Scalable Multi-Rate Speech Codec	68
	3.6.2.1 Narrowband Codec	68
	3.6.2.2 Wideband Codec	74
	3.7 Conclusions	79
	References	79
	4 Recent Speech Coding Technologies and Standards	83
	4.1 Recent Speech Codec Technologies and Features	84
	4.1.1 Active Speech Source-Controlled Variable Bit Rate, Constant Bit Rate Operation and Voice Activity Detectors	84
	4.1.1.1 Source-Controlled Variable Bit Rate (SC-VBR) Versus Constant/Fixed Bit Rate (CBR) Vocoders	85
	4.1.2 Layered Coding	86
	4.1.3 Bandwidth Extension of Speech	87
	4.1.3.1 Harmonic Bandwidth Extension Architecture	88
	4.1.3.2 Spectral Band Replication (SBR)	89
	4.1.4 Blind Bandwidth Extension	90
	4.1.4.1 High Band Model and Prediction Methods	91
	4.1.4.2 BBE for Speech Coding	91
	4.1.4.3 BBE for Bandwidth Increase	92
	4.1.4.4 Quality Evaluation	92
	4.1.4.5 Encoder Based BBE	93
	4.1.5 Packet Loss Concealment	95
	4.1.5.1 Code Excited Linear Prediction Coders	96
	4.1.5.2 Adaptive Differential Pulse Code Modulation (ADPCM) Based Coders	97
	4.1.6 Voice Over Internet Protocol (VoIP)	97
	4.1.6.1 Management of Time Varying Delay	98
	4.1.6.2 Packet Loss Concealment for VoIP	99
	4.2 Recent Speech Coding Standards	102
	4.2.1 Advanced Standards in ITU-T	102
	4.2.1.1 G.729.1: Scalable Extension of G.729	103
	4.2.1.2 G.718: Layered Coder with Interoperable Modes	104
	4.2.1.3 Super-Wideband Extensions: G.729.1 Annex E and G.718 Annex B	104
	4.2.1.4 G.711.1: Scalable Wideband Extension of G.711	105
	4.2.1.5 Super-Wideband and Stereo Extensions of G.711.1 and G.722	105
	4.2.1.6 Full-Band Coding in G.719	107
	4.2.1.7 G.711.0 Lossless Coding	107
	4.2.1.8 Packet Loss Concealment Algorithms for G.711 and G.722	108
	4.2.2 IETF Codecs and Transport Protocols	108
	4.2.2.1 Opus Codec	108
	Audio Bandwidths and Bit Rate Sweet Spots	109
	Variable and Constant Bit Rate Modes of Operation	109
	Mono and Stereo Coding	110
	Packet Loss Resilience	110
	Forward Error Correction (Low Bit Rate Redundancy)	110
	4.2.2.2 RTP Payload Formats	110
	4.2.3 3GPP and the Enhanced Voice Services (EVS) Codec	111
	4.2.4 Recent Codec Development in 3GPP2	112
	4.2.5 Conversational Codecs in MPEG	113
	References	115
	Part II Review and Challenges in Speech, Speaker and Emotion Recognition	118
	5 Ensemble Learning Approaches in Speech Recognition	119
	5.1 Introduction	119
	5.2 Background of Ensemble Methods in Machine Learning	120
	5.2.1 Ensemble Learning	120
	5.2.2 Boosting	121
	5.2.3 Bagging	122
	5.2.4 Random Forest	122
	5.2.5 Classifier Combination	123
	5.2.6 Ensemble Error Analyses	124
	5.2.6.1 Added Error of an Ensemble Classifier	124
	5.2.6.2 Bias–Variance–Covariance Decomposition	124
	5.2.6.3 Error-Ambiguity Decomposition	125
	5.2.7 Diversity Measures	126
	5.2.8 Ensemble Pruning	126
	5.2.9 Ensemble Clustering	127
	5.3 Background of Speech Recognition	127
	5.3.1 State-of-the-Art Speech Recognition System Architecture	127
	5.3.2 Front-End Processing	128
	5.3.3 Lexicon	129
	5.3.4 Acoustic Model	129
	5.3.5 Language Model	130
	5.3.6 Decoding Search	131
	5.4 Generating and Combining Diversity in Speech Recognition	132
	5.4.1 System Places for Generating Diversity	132
	5.4.1.1 Front End Processing	132
	5.4.1.2 Acoustic Model	132
	5.4.1.3 Language Model	133
	5.4.2 System Levels for Utilizing Diversity	133
	5.4.2.1 Utterance Level Combination	134
	5.4.2.2 Word Level Combination	134
	5.4.2.3 Subword Level Combination	136
	5.4.2.4 State Level Combination	136
	5.4.2.5 Feature Level Combination	139
	5.5 Ensemble Learning Techniques for Acoustic Modeling	139
	5.5.1 Explicit Diversity Generation	140
	5.5.1.1 Boosting	140
	5.5.1.2 Minimum Bayes Risk Leveraging (MBRL)	143
	5.5.1.3 Directed Decision Trees	144
	5.5.1.4 Deep Stacking Network	144
	5.5.2 Implicit Diversity Generation	145
	5.5.2.1 Multiple Systems and Multiple Models	145
	5.5.2.2 Random Forest	146
	5.5.2.3 Data Sampling	147
	5.6 Ensemble Learning Techniques for Language Modeling	148
	5.7 Performance Enhancing Mechanism of Ensemble Learning	149
	5.7.1 Classification Margin	149
	5.7.2 Diversity	150
	5.7.3 Bias and Variance	151
	5.8 Compacting Ensemble Models to Improve Efficiency	152
	5.8.1 Model Clustering	153
	5.8.2 Density Matching	153
	5.9 Conclusion	154
	References	155
	6 Deep Dynamic Models for Learning Hidden Representations of Speech Features	159
	6.1 Introduction	160
	6.2 Generative Deep-Structured Speech Dynamics: Model Formulation	161
	6.2.1 Generative Learning in Speech Recognition	161
	6.2.2 A Hidden Dynamic Model with Nonlinear Observation Equation	166
	6.2.3 A Linear Hidden Dynamic Model Amenable to Variational EM Training	167
	6.3 Generative Deep-Structured Speech Dynamics: Model Estimation	169
	6.3.1 Learning a Hidden Dynamic Model Using the Extended Kalman Filter	169
	6.3.1.1 E-Step	169
	6.3.1.2 M-Step	170
	6.3.2 Learning a Hidden Dynamic Model Using Variational EM	172
	6.3.2.1 Model Inference and Learning	172
	6.3.2.2 The GMM Posterior	172
	6.3.2.3 The HMM Posterior	173
	6.4 Discriminative Deep Neural Networks Aided by Generative Pre-training	175
	6.4.1 Restricted Boltzmann Machines	176
	6.4.2 Stacking Up RBMs to Form a DBN	178
	6.4.3 Interfacing the DNN with an HMM to Incorporate Sequential Dynamics	180
	6.5 Recurrent Neural Networks for Discriminative Modeling of Speech Dynamics	181
	6.5.1 RNNs Expressed in the State-Space Formalism	182
	6.5.2 The BPTT Learning Algorithm	183
	6.5.3 The EKF Learning Algorithm	186
	6.6 Comparing Two Types of Dynamic Models	187
	6.6.1 Top-Down Versus Bottom-Up	187
	6.6.1.1 Top-Down Generative Hidden Dynamic Modeling	187
	6.6.1.2 Bottom-Up Discriminative Recurrent Neural Networks and the ``Generative'' Counterpart	188
	6.6.2 Localist Versus Distributed Representations	190
	6.6.3 Latent Explanatory Variables Versus End-to-End Discriminative Learning	191
	6.6.4 Parsimonious Versus Massive Parameters	192
	6.6.5 Comparing Recognition Accuracy of the Two Types of Models	194
	6.7 Summary and Discussions on Future Directions	194
	References	196
	7 Speech Based Emotion Recognition	202
	7.1 Introduction	202
	7.1.1 What Are Emotions?	203
	7.1.2 Emotion Labels	205
	7.1.3 The Emotion Recognition Task	207
	7.2 Emotion Classification Systems	208
	7.2.1 Short-Term Features	209
	7.2.1.1 Pitch	209
	7.2.1.2 Loudness/Energy	209
	7.2.1.3 Spectral Features	210
	7.2.1.4 Cepstral Features	210
	7.2.2 High Dimensional Representation	210
	7.2.2.1 Functional Approach to a High-Dimensional Representation	211
	7.2.2.2 GMM Supervector Approach to High-Dimensional Representation	212
	7.2.3 Modelling Emotions	213
	7.2.3.1 Emotion Models: Linear Support Vector Machines	214
	7.2.3.2 Emotion Models: Nonlinear Support Vector Machines	215
	7.2.4 Alternative Emotion Modelling Methodologies	216
	7.2.4.1 Supra-Frame Level Feature	217
	7.2.4.2 Dynamic Emotion Models	218
	7.3 Dealing with Variability	219
	7.3.1 Phonetic Variability in Emotion Recognition Systems	219
	7.3.2 Speaker Variability	221
	7.3.2.1 Speaker Normalisation	222
	7.3.2.2 Speaker Adaptation	223
	7.4 Comparing Systems	224
	7.5 Conclusions	226
	References	228
	8 Speaker Diarization: An Emerging Research	234
	8.1 Overview	234
	8.2 Signal Processing	235
	8.2.1 Wiener Filtering	236
	8.2.2 Acoustic Beamforming	236
	8.3 Feature Extraction	237
	8.3.1 Acoustic Features	237
	8.3.1.1 Short-Term Spectral Features	238
	8.3.1.2 Prosodic Features	239
	8.3.2 Sound Source Features	239
	8.3.3 Feature Normalization Techniques	241
	8.3.3.1 RASTA Filtering	241
	8.3.3.2 Cepstral Mean Normalization	242
	8.3.3.3 Feature Warping	242
	8.4 Speech Activity Detection	242
	8.4.1 Energy-Based Speech Detection	243
	8.4.2 Model Based Speech Detection	243
	8.4.3 Hybrid Speech Detection	243
	8.4.4 Multi-Channel Speech Detection	245
	8.5 Clustering Architecture	245
	8.5.1 Speaker Modeling	247
	8.5.1.1 Gaussian Mixture Model	247
	8.5.1.2 Hidden Markov Model	248
	8.5.1.3 Total Factor Vector	249
	8.5.1.4 Other Modeling Approaches	250
	8.5.2 Distance Measures	251
	8.5.2.1 Symmetric Kullback-Leibler Distance	252
	8.5.2.2 Divergence Shape Distance	253
	8.5.2.3 Arithmetic Harmonic Sphericity	253
	8.5.2.4 Generalized Likelihood Ratio	253
	8.5.2.5 Bayesian Information Criterion	254
	8.5.2.6 Cross Likelihood Ratio	256
	8.5.2.7 Normalized Cross Likelihood Ratio	256
	8.5.2.8 Other Distance Measures	256
	8.5.3 Speaker Segmentation	257
	8.5.3.1 Silence Detection Based Methods	257
	8.5.3.2 Metric-Based Segmentation	258
	8.5.3.3 Hybrid Segmentation	260
	8.5.3.4 Segmentation Evaluation	260
	8.5.4 Speaker Clustering	261
	8.5.4.1 Agglomerative Hierarchical Clustering	261
	8.5.4.2 Divisive Hierarchical Clustering	266
	8.5.4.3 Other Approaches	267
	8.5.4.4 Multiple Systems Combination	268
	8.5.5 Online Speaker Clustering	268
	Segmentation.	268
	Novelty Detection.	269
	Speaker Modeling.	270
	8.5.5.1 Speaker Clustering Evaluation	270
	8.6 Speaker Diarization Evaluation	272
	8.7 Databases for Speaker Diarization in Meeting	272
	8.8 Related Projects in Meeting Room	273
	8.9 NIST Rich Transcription Benchmarks	273
	8.10 Summary	274
	References	275
	Part III Current Trends in Speech Enhancement	283
	9 Maximum A Posteriori Spectral Estimation with Source Log-Spectral Priors for Multichannel Speech Enhancement	284
	9.1 Introduction	285
	9.2 Signal Representation and Modeling for Multichannel Speech Enhancement	287
	9.2.1 General Speech Capture Scenario for Multichannel Speech Enhancement	287
	9.2.2 Time-Frequency Domain Representation of Signals	289
	9.2.3 Generative Model of Desired Signals	290
	9.2.4 Generative Model of Interference	292
	9.3 Speech Enhancement Based on Maximum Likelihood Spectral Estimation (MLSE)	293
	9.3.1 Maximum Likelihood Spectral Estimation (MLSE)	293
	9.3.2 Processing Flow of MLSE Based Speech Enhancement	294
	9.4 Speech Enhancement Based on Maximum A Posteriori Spectral Estimation (MAPSE)	295
	9.4.1 Maximum A Posteriori Spectral Estimation (MAPSE)	296
	9.4.2 Log-Spectral Prior of Speech	297
	9.4.3 Expectation Maximization (EM) Algorithm	299
	9.4.4 Update of n,f Based on Newton–Raphson Method	301
	9.4.5 Processing Flow	302
	9.5 Application to Blind Source Separation (BSS)	303
	9.5.1 MLSE for BSS (ML-BSS)	303
	9.5.1.1 Generative Models for ML-BSS	304
	9.5.1.2 MLSE Based on EM Algorithm	305
	9.5.1.3 Processing Flow of ML-BSS Based on EM Algorithm	307
	9.5.2 MAPSE for BSS (MAP-BSS)	308
	9.5.2.1 Generative Models for MAP-BSS	308
	9.5.2.2 MAPSE Based on EM Algorithm	309
	9.5.2.3 Processing Flow of MAP-BSS Based on EM Algorithm	311
	9.5.2.4 Initialization of and (or )	312
	9.6 Experiments	313
	9.6.1 Evaluation 1 with Aurora-2 Speech Database	313
	9.6.2 Evaluation 2 with SiSEC Database	316
	9.7 Concluding Remarks	318
	References	318
	10 Modulation Processing for Speech Enhancement	321
	10.1 Introduction	322
	10.2 Methods	324
	10.2.1 Modulation AMS-Based Framework	324
	10.2.2 Modulation Spectral Subtraction	327
	10.2.3 MMSE Modulation Magnitude Estimation	330
	10.2.3.1 MMSE Modulation Magnitude Estimation with SPU	333
	10.2.3.2 MMSE Log-Modulation Magnitude Estimation	333
	10.2.3.3 MME Parameters	334
	10.3 Speech Quality Assessment	334
	10.4 Evaluation of Short-Time Modulation-Domain Based Methods with Respect to Quality	335
	10.5 Conclusion	342
	References	344