Hilfe Warenkorb Konto Anmelden
 
 
   Schnellsuche   
     zur Expertensuche                      
Speech and Audio Processing for Coding, Enhancement and Recognition
  Großes Bild
 
Speech and Audio Processing for Coding, Enhancement and Recognition
von: Tokunbo Ogunfunmi, Roberto Togneri, Madihally Sim Narasimha
Springer-Verlag, 2014
ISBN: 9781493914562
347 Seiten, Download: 6254 KB
 
Format:  PDF
geeignet für: Apple iPad, Android Tablet PC's Online-Lesen PC, MAC, Laptop

Typ: B (paralleler Zugriff)

 

 
eBook anfordern
Inhaltsverzeichnis

  Preface 6  
  Contents 10  
  Part I Overview of Speech and Audio Coding 12  
     1 From “Harmonic Telegraph” to Cellular Phones 13  
        1.1 Introduction 13  
           1.1.1 The Multiple Telegraph “Harmonic Telegraph” 14  
           1.1.2 Bell's Theory of Transmitting Speech 14  
        1.2 Early History of the Telephone 15  
           1.2.1 The Telephone Is Born 15  
           1.2.2 Birth of the Telephone Company 15  
              1.2.2.1 Research at Bell Company 16  
              1.2.2.2 New York to San Francisco Telephone Service in 1915, Nobel Prize, and More 16  
        1.3 Speech Bandwidth Compression at AT&T 17  
           1.3.1 Early Research on “vocoders” 17  
           1.3.2 Predictive Coding 18  
           1.3.3 Efficient Encoding of Prediction Error 19  
              1.3.3.1 Some Comments on the Nature of Prediction Error for Speech 19  
              1.3.3.2 Information Rate of Gaussian Signals with Specified Fidelity Criterion 20  
              1.3.3.3 Predictive Coding with Specified Error Spectrum 20  
              1.3.3.4 Overcoming the Computational Complexity of Predictive Coders 22  
        1.4 Cellular Telephone Service 24  
           1.4.1 Digital Cellular Standards 25  
              1.4.1.1 North American Digital Cellular Standards 25  
              1.4.1.2 European Digital Cellular Standards 25  
        1.5 The Future 26  
        References 26  
     2 Challenges in Speech Coding Research 28  
        2.1 Introduction 28  
        2.2 Speech Coding 29  
           2.2.1 Speech Coding Methods 30  
              2.2.1.1 Waveform Coding [2] 30  
              2.2.1.2 Subband and Transform Methods [2] 31  
              2.2.1.3 Analysis-by-Synthesis Methods [2, 10] 32  
              2.2.1.4 Postfiltering [11] 35  
              2.2.1.5 Voice Activity Detection and Silence Coding 35  
           2.2.2 Speech Coding Standards 35  
              2.2.2.1 ITU-T Standards 36  
              2.2.2.2 Digital Cellular Standards 38  
              2.2.2.3 VoIP Standards 40  
        2.3 Audio Coding [25, 26] 40  
        2.4 Newer Standards 41  
        2.5 Emerging Topics 45  
        2.6 Conclusions and Future Research Directions 46  
        References 46  
     3 Scalable and Multi-Rate Speech Coding for Voice-over-Internet Protocol (VoIP) Networks 49  
        3.1 Introduction 49  
        3.2 VoIP Networks 50  
           3.2.1 Overview of VoIP Networks 51  
           3.2.2 Robust Voice Communication 51  
           3.2.3 Packet Loss Concealment (PLC) 51  
        3.3 Analysis-by-Synthesis Speech Coding 53  
           3.3.1 Analysis-by-Synthesis Principles 53  
           3.3.2 CELP-Based Coders 53  
              3.3.2.1 Perceptual Error Weighting 56  
              3.3.2.2 Pitch Estimation 56  
        3.4 Multi-Rate Speech Coding 57  
           3.4.1 Basic Principles 57  
           3.4.2 Adaptive Multi-Rate (AMR) Codec 59  
        3.5 Scalable Speech Coding 60  
           3.5.1 Basic Principles 60  
           3.5.2 Standardized Scalable Speech Codecs 60  
              3.5.2.1 ITU-T G.729.1 61  
              3.5.2.2 ITU-T G.718 62  
        3.6 Packet-Loss Robust Speech Coding 65  
           3.6.1 Internet Low Bitrate Codec (iLBC) 67  
           3.6.2 Scalable Multi-Rate Speech Codec 68  
              3.6.2.1 Narrowband Codec 68  
              3.6.2.2 Wideband Codec 74  
        3.7 Conclusions 79  
        References 79  
     4 Recent Speech Coding Technologies and Standards 83  
        4.1 Recent Speech Codec Technologies and Features 84  
           4.1.1 Active Speech Source-Controlled Variable Bit Rate, Constant Bit Rate Operation and Voice Activity Detectors 84  
              4.1.1.1 Source-Controlled Variable Bit Rate (SC-VBR) Versus Constant/Fixed Bit Rate (CBR) Vocoders 85  
           4.1.2 Layered Coding 86  
           4.1.3 Bandwidth Extension of Speech 87  
              4.1.3.1 Harmonic Bandwidth Extension Architecture 88  
              4.1.3.2 Spectral Band Replication (SBR) 89  
           4.1.4 Blind Bandwidth Extension 90  
              4.1.4.1 High Band Model and Prediction Methods 91  
              4.1.4.2 BBE for Speech Coding 91  
              4.1.4.3 BBE for Bandwidth Increase 92  
              4.1.4.4 Quality Evaluation 92  
              4.1.4.5 Encoder Based BBE 93  
           4.1.5 Packet Loss Concealment 95  
              4.1.5.1 Code Excited Linear Prediction Coders 96  
              4.1.5.2 Adaptive Differential Pulse Code Modulation (ADPCM) Based Coders 97  
           4.1.6 Voice Over Internet Protocol (VoIP) 97  
              4.1.6.1 Management of Time Varying Delay 98  
              4.1.6.2 Packet Loss Concealment for VoIP 99  
        4.2 Recent Speech Coding Standards 102  
           4.2.1 Advanced Standards in ITU-T 102  
              4.2.1.1 G.729.1: Scalable Extension of G.729 103  
              4.2.1.2 G.718: Layered Coder with Interoperable Modes 104  
              4.2.1.3 Super-Wideband Extensions: G.729.1 Annex E and G.718 Annex B 104  
              4.2.1.4 G.711.1: Scalable Wideband Extension of G.711 105  
              4.2.1.5 Super-Wideband and Stereo Extensions of G.711.1 and G.722 105  
              4.2.1.6 Full-Band Coding in G.719 107  
              4.2.1.7 G.711.0 Lossless Coding 107  
              4.2.1.8 Packet Loss Concealment Algorithms for G.711 and G.722 108  
           4.2.2 IETF Codecs and Transport Protocols 108  
              4.2.2.1 Opus Codec 108  
                 Audio Bandwidths and Bit Rate Sweet Spots 109  
                 Variable and Constant Bit Rate Modes of Operation 109  
                 Mono and Stereo Coding 110  
                 Packet Loss Resilience 110  
                 Forward Error Correction (Low Bit Rate Redundancy) 110  
              4.2.2.2 RTP Payload Formats 110  
           4.2.3 3GPP and the Enhanced Voice Services (EVS) Codec 111  
           4.2.4 Recent Codec Development in 3GPP2 112  
           4.2.5 Conversational Codecs in MPEG 113  
        References 115  
  Part II Review and Challenges in Speech, Speaker and Emotion Recognition 118  
     5 Ensemble Learning Approaches in Speech Recognition 119  
        5.1 Introduction 119  
        5.2 Background of Ensemble Methods in Machine Learning 120  
           5.2.1 Ensemble Learning 120  
           5.2.2 Boosting 121  
           5.2.3 Bagging 122  
           5.2.4 Random Forest 122  
           5.2.5 Classifier Combination 123  
           5.2.6 Ensemble Error Analyses 124  
              5.2.6.1 Added Error of an Ensemble Classifier 124  
              5.2.6.2 Bias–Variance–Covariance Decomposition 124  
              5.2.6.3 Error-Ambiguity Decomposition 125  
           5.2.7 Diversity Measures 126  
           5.2.8 Ensemble Pruning 126  
           5.2.9 Ensemble Clustering 127  
        5.3 Background of Speech Recognition 127  
           5.3.1 State-of-the-Art Speech Recognition System Architecture 127  
           5.3.2 Front-End Processing 128  
           5.3.3 Lexicon 129  
           5.3.4 Acoustic Model 129  
           5.3.5 Language Model 130  
           5.3.6 Decoding Search 131  
        5.4 Generating and Combining Diversity in Speech Recognition 132  
           5.4.1 System Places for Generating Diversity 132  
              5.4.1.1 Front End Processing 132  
              5.4.1.2 Acoustic Model 132  
              5.4.1.3 Language Model 133  
           5.4.2 System Levels for Utilizing Diversity 133  
              5.4.2.1 Utterance Level Combination 134  
              5.4.2.2 Word Level Combination 134  
              5.4.2.3 Subword Level Combination 136  
              5.4.2.4 State Level Combination 136  
              5.4.2.5 Feature Level Combination 139  
        5.5 Ensemble Learning Techniques for Acoustic Modeling 139  
           5.5.1 Explicit Diversity Generation 140  
              5.5.1.1 Boosting 140  
              5.5.1.2 Minimum Bayes Risk Leveraging (MBRL) 143  
              5.5.1.3 Directed Decision Trees 144  
              5.5.1.4 Deep Stacking Network 144  
           5.5.2 Implicit Diversity Generation 145  
              5.5.2.1 Multiple Systems and Multiple Models 145  
              5.5.2.2 Random Forest 146  
              5.5.2.3 Data Sampling 147  
        5.6 Ensemble Learning Techniques for Language Modeling 148  
        5.7 Performance Enhancing Mechanism of Ensemble Learning 149  
           5.7.1 Classification Margin 149  
           5.7.2 Diversity 150  
           5.7.3 Bias and Variance 151  
        5.8 Compacting Ensemble Models to Improve Efficiency 152  
           5.8.1 Model Clustering 153  
           5.8.2 Density Matching 153  
        5.9 Conclusion 154  
        References 155  
     6 Deep Dynamic Models for Learning Hidden Representations of Speech Features 159  
        6.1 Introduction 160  
        6.2 Generative Deep-Structured Speech Dynamics: Model Formulation 161  
           6.2.1 Generative Learning in Speech Recognition 161  
           6.2.2 A Hidden Dynamic Model with Nonlinear Observation Equation 166  
           6.2.3 A Linear Hidden Dynamic Model Amenable to Variational EM Training 167  
        6.3 Generative Deep-Structured Speech Dynamics: Model Estimation 169  
           6.3.1 Learning a Hidden Dynamic Model Using the Extended Kalman Filter 169  
              6.3.1.1 E-Step 169  
              6.3.1.2 M-Step 170  
           6.3.2 Learning a Hidden Dynamic Model Using Variational EM 172  
              6.3.2.1 Model Inference and Learning 172  
              6.3.2.2 The GMM Posterior 172  
              6.3.2.3 The HMM Posterior 173  
        6.4 Discriminative Deep Neural Networks Aided by Generative Pre-training 175  
           6.4.1 Restricted Boltzmann Machines 176  
           6.4.2 Stacking Up RBMs to Form a DBN 178  
           6.4.3 Interfacing the DNN with an HMM to Incorporate Sequential Dynamics 180  
        6.5 Recurrent Neural Networks for Discriminative Modeling of Speech Dynamics 181  
           6.5.1 RNNs Expressed in the State-Space Formalism 182  
           6.5.2 The BPTT Learning Algorithm 183  
           6.5.3 The EKF Learning Algorithm 186  
        6.6 Comparing Two Types of Dynamic Models 187  
           6.6.1 Top-Down Versus Bottom-Up 187  
              6.6.1.1 Top-Down Generative Hidden Dynamic Modeling 187  
              6.6.1.2 Bottom-Up Discriminative Recurrent Neural Networks and the ``Generative'' Counterpart 188  
           6.6.2 Localist Versus Distributed Representations 190  
           6.6.3 Latent Explanatory Variables Versus End-to-End Discriminative Learning 191  
           6.6.4 Parsimonious Versus Massive Parameters 192  
           6.6.5 Comparing Recognition Accuracy of the Two Types of Models 194  
        6.7 Summary and Discussions on Future Directions 194  
        References 196  
     7 Speech Based Emotion Recognition 202  
        7.1 Introduction 202  
           7.1.1 What Are Emotions? 203  
           7.1.2 Emotion Labels 205  
           7.1.3 The Emotion Recognition Task 207  
        7.2 Emotion Classification Systems 208  
           7.2.1 Short-Term Features 209  
              7.2.1.1 Pitch 209  
              7.2.1.2 Loudness/Energy 209  
              7.2.1.3 Spectral Features 210  
              7.2.1.4 Cepstral Features 210  
           7.2.2 High Dimensional Representation 210  
              7.2.2.1 Functional Approach to a High-Dimensional Representation 211  
              7.2.2.2 GMM Supervector Approach to High-Dimensional Representation 212  
           7.2.3 Modelling Emotions 213  
              7.2.3.1 Emotion Models: Linear Support Vector Machines 214  
              7.2.3.2 Emotion Models: Nonlinear Support Vector Machines 215  
           7.2.4 Alternative Emotion Modelling Methodologies 216  
              7.2.4.1 Supra-Frame Level Feature 217  
              7.2.4.2 Dynamic Emotion Models 218  
        7.3 Dealing with Variability 219  
           7.3.1 Phonetic Variability in Emotion Recognition Systems 219  
           7.3.2 Speaker Variability 221  
              7.3.2.1 Speaker Normalisation 222  
              7.3.2.2 Speaker Adaptation 223  
        7.4 Comparing Systems 224  
        7.5 Conclusions 226  
        References 228  
     8 Speaker Diarization: An Emerging Research 234  
        8.1 Overview 234  
        8.2 Signal Processing 235  
           8.2.1 Wiener Filtering 236  
           8.2.2 Acoustic Beamforming 236  
        8.3 Feature Extraction 237  
           8.3.1 Acoustic Features 237  
              8.3.1.1 Short-Term Spectral Features 238  
              8.3.1.2 Prosodic Features 239  
           8.3.2 Sound Source Features 239  
           8.3.3 Feature Normalization Techniques 241  
              8.3.3.1 RASTA Filtering 241  
              8.3.3.2 Cepstral Mean Normalization 242  
              8.3.3.3 Feature Warping 242  
        8.4 Speech Activity Detection 242  
           8.4.1 Energy-Based Speech Detection 243  
           8.4.2 Model Based Speech Detection 243  
           8.4.3 Hybrid Speech Detection 243  
           8.4.4 Multi-Channel Speech Detection 245  
        8.5 Clustering Architecture 245  
           8.5.1 Speaker Modeling 247  
              8.5.1.1 Gaussian Mixture Model 247  
              8.5.1.2 Hidden Markov Model 248  
              8.5.1.3 Total Factor Vector 249  
              8.5.1.4 Other Modeling Approaches 250  
           8.5.2 Distance Measures 251  
              8.5.2.1 Symmetric Kullback-Leibler Distance 252  
              8.5.2.2 Divergence Shape Distance 253  
              8.5.2.3 Arithmetic Harmonic Sphericity 253  
              8.5.2.4 Generalized Likelihood Ratio 253  
              8.5.2.5 Bayesian Information Criterion 254  
              8.5.2.6 Cross Likelihood Ratio 256  
              8.5.2.7 Normalized Cross Likelihood Ratio 256  
              8.5.2.8 Other Distance Measures 256  
           8.5.3 Speaker Segmentation 257  
              8.5.3.1 Silence Detection Based Methods 257  
              8.5.3.2 Metric-Based Segmentation 258  
              8.5.3.3 Hybrid Segmentation 260  
              8.5.3.4 Segmentation Evaluation 260  
           8.5.4 Speaker Clustering 261  
              8.5.4.1 Agglomerative Hierarchical Clustering 261  
              8.5.4.2 Divisive Hierarchical Clustering 266  
              8.5.4.3 Other Approaches 267  
              8.5.4.4 Multiple Systems Combination 268  
           8.5.5 Online Speaker Clustering 268  
              Segmentation. 268  
              Novelty Detection. 269  
              Speaker Modeling. 270  
              8.5.5.1 Speaker Clustering Evaluation 270  
        8.6 Speaker Diarization Evaluation 272  
        8.7 Databases for Speaker Diarization in Meeting 272  
        8.8 Related Projects in Meeting Room 273  
        8.9 NIST Rich Transcription Benchmarks 273  
        8.10 Summary 274  
        References 275  
  Part III Current Trends in Speech Enhancement 283  
     9 Maximum A Posteriori Spectral Estimation with Source Log-Spectral Priors for Multichannel Speech Enhancement 284  
        9.1 Introduction 285  
        9.2 Signal Representation and Modeling for Multichannel Speech Enhancement 287  
           9.2.1 General Speech Capture Scenario for Multichannel Speech Enhancement 287  
           9.2.2 Time-Frequency Domain Representation of Signals 289  
           9.2.3 Generative Model of Desired Signals 290  
           9.2.4 Generative Model of Interference 292  
        9.3 Speech Enhancement Based on Maximum Likelihood Spectral Estimation (MLSE) 293  
           9.3.1 Maximum Likelihood Spectral Estimation (MLSE) 293  
           9.3.2 Processing Flow of MLSE Based Speech Enhancement 294  
        9.4 Speech Enhancement Based on Maximum A Posteriori Spectral Estimation (MAPSE) 295  
           9.4.1 Maximum A Posteriori Spectral Estimation (MAPSE) 296  
           9.4.2 Log-Spectral Prior of Speech 297  
           9.4.3 Expectation Maximization (EM) Algorithm 299  
           9.4.4 Update of n,f Based on Newton–Raphson Method 301  
           9.4.5 Processing Flow 302  
        9.5 Application to Blind Source Separation (BSS) 303  
           9.5.1 MLSE for BSS (ML-BSS) 303  
              9.5.1.1 Generative Models for ML-BSS 304  
              9.5.1.2 MLSE Based on EM Algorithm 305  
              9.5.1.3 Processing Flow of ML-BSS Based on EM Algorithm 307  
           9.5.2 MAPSE for BSS (MAP-BSS) 308  
              9.5.2.1 Generative Models for MAP-BSS 308  
              9.5.2.2 MAPSE Based on EM Algorithm 309  
              9.5.2.3 Processing Flow of MAP-BSS Based on EM Algorithm 311  
              9.5.2.4 Initialization of and (or ) 312  
        9.6 Experiments 313  
           9.6.1 Evaluation 1 with Aurora-2 Speech Database 313  
           9.6.2 Evaluation 2 with SiSEC Database 316  
        9.7 Concluding Remarks 318  
        References 318  
     10 Modulation Processing for Speech Enhancement 321  
        10.1 Introduction 322  
        10.2 Methods 324  
           10.2.1 Modulation AMS-Based Framework 324  
           10.2.2 Modulation Spectral Subtraction 327  
           10.2.3 MMSE Modulation Magnitude Estimation 330  
              10.2.3.1 MMSE Modulation Magnitude Estimation with SPU 333  
              10.2.3.2 MMSE Log-Modulation Magnitude Estimation 333  
              10.2.3.3 MME Parameters 334  
        10.3 Speech Quality Assessment 334  
        10.4 Evaluation of Short-Time Modulation-Domain Based Methods with Respect to Quality 335  
        10.5 Conclusion 342  
        References 344  


nach oben


  Mehr zum Inhalt
Kapitelübersicht
Kurzinformation
Inhaltsverzeichnis
Leseprobe
Blick ins Buch
Fragen zu eBooks?

  Navigation
Belletristik / Romane
Computer
Geschichte
Kultur
Medizin / Gesundheit
Philosophie / Religion
Politik
Psychologie / Pädagogik
Ratgeber
Recht
Reise / Hobbys
Sexualität / Erotik
Technik / Wissen
Wirtschaft

  Info
Hier gelangen Sie wieder zum Online-Auftritt Ihrer Bibliothek
© 2008-2024 ciando GmbH | Impressum | Kontakt | F.A.Q. | Datenschutz