Hilfe Warenkorb Konto Anmelden
 
 
   Schnellsuche   
     zur Expertensuche                      
New Era for Robust Speech Recognition - Exploiting Deep Learning
  Großes Bild
 
New Era for Robust Speech Recognition - Exploiting Deep Learning
von: Shinji Watanabe, Marc Delcroix, Florian Metze, John R. Hershey
Springer-Verlag, 2017
ISBN: 9783319646800
433 Seiten, Download: 9135 KB
 
Format:  PDF
geeignet für: Apple iPad, Android Tablet PC's Online-Lesen PC, MAC, Laptop

Typ: B (paralleler Zugriff)

 

 
eBook anfordern
Inhaltsverzeichnis

  Preface 6  
  Acknowledgments 7  
  Contents 8  
  Acronyms 11  
  Part I Introduction 14  
     1 Preliminaries 15  
        1.1 Introduction 15  
           1.1.1 Motivation 15  
           1.1.2 Before the Deep Learning Era 16  
              1.1.2.1 Feature Space Approaches 17  
              1.1.2.2 Model Space Approaches 18  
        1.2 Basic Formulation and Notations 18  
           1.2.1 General Notations (Tables 1.1 and 1.2) 19  
           1.2.2 Matrix and Vector Operations (Table 1.3) 20  
           1.2.3 Probability Distribution Functions (Table 1.4) 20  
              1.2.3.1 Expectation 21  
              1.2.3.2 Kullback–Leibler Divergence 21  
           1.2.4 Signal Processing 22  
           1.2.5 Automatic Speech Recognition 23  
           1.2.6 Hidden Markov Model 24  
           1.2.7 Gaussian Mixture Model 25  
           1.2.8 Neural Network 26  
        1.3 Book Organization 27  
        References 28  
  Part II Approaches to Robust Automatic Speech Recognition 30  
     2 Multichannel Speech Enhancement Approaches to DNN-Based Far-Field Speech Recognition 31  
        2.1 Introduction 31  
           2.1.1 Categories of Speech Enhancement 32  
           2.1.2 Problem Formulation 32  
        2.2 Dereverberation 34  
           2.2.1 Problem Description 34  
           2.2.2 Overview of Existing Dereverberation Approaches 36  
           2.2.3 Linear-Prediction-Based Dereverberation 37  
        2.3 Beamforming 39  
           2.3.1 Types of Beamformers 40  
              2.3.1.1 Delay-and-Sum Beamformer 40  
              2.3.1.2 Minimum Variance Distortionless Response Beamformer 42  
              2.3.1.3 Max-SNR Beamformer 43  
              2.3.1.4 Multichannel Wiener Filter 44  
           2.3.2 Parameter Estimation 45  
              2.3.2.1 TDOA Estimation 46  
              2.3.2.2 Steering-Vector Estimation 47  
              2.3.2.3 Time–Frequency-Masking-Based Spatial Correlation Matrix Estimation 48  
        2.4 Examples of Robust Front Ends 52  
           2.4.1 A Reverberation-Robust ASR System 53  
              2.4.1.1 Experimental Settings 53  
              2.4.1.2 Experimental Results 53  
           2.4.2 Robust ASR System for Mobile Devices 55  
              2.4.2.1 Experimental Settings 55  
              2.4.2.2 Experimental Results 56  
        2.5 Concluding Remarks and Discussion 56  
        References 57  
     3 Multichannel Spatial Clustering Using Model-Based Source Separation 60  
        3.1 Introduction 60  
        3.2 Multichannel Speech Signals 61  
           3.2.1 Binaural Cues Used by Human Listeners 62  
           3.2.2 Parameters for More than Two Channels 64  
        3.3 Spatial-Clustering Approaches 66  
           3.3.1 Binwise Clustering and Alignment 67  
              3.3.1.1 Cross-Frequency Source Alignment 68  
           3.3.2 Fuzzy c-Means Clustering of Direction of Arrival 69  
           3.3.3 Binaural Model-Based EM Source Separation and Localization (MESSL) 70  
           3.3.4 Multichannel MESSL 71  
        3.4 Mask-Smoothing Approaches 73  
           3.4.1 Fuzzy Clustering with Context Information 73  
           3.4.2 MESSL in a Markov Random Field 74  
              3.4.2.1 Pairwise Markov Random Fields 74  
              3.4.2.2 MESSL-MRF 75  
        3.5 Driving Beamforming from Spatial Clustering 76  
        3.6 Automatic Speech Recognition Experiments 78  
           3.6.1 Results 79  
           3.6.2 Example Separations 81  
        3.7 Conclusion 83  
        References 83  
     4 Discriminative Beamforming with Phase-Aware Neural Networks for Speech Enhancement and Recognition 87  
        4.1 Introduction 88  
        4.2 Beamforming for ASR 88  
           4.2.1 Geometric Beamforming 89  
           4.2.2 Statistical Methods 91  
           4.2.3 Learning-Based Methods 92  
              4.2.3.1 Maximum Likelihood Approach 92  
              4.2.3.2 Neural Network Approaches with Multichannel Inputs 93  
              4.2.3.3 Neural Networks for Better Spatial-Statistics Estimation 94  
        4.3 Beamforming Networks 95  
           4.3.1 Motivation 95  
           4.3.2 System Overview 95  
           4.3.3 Predicting Beamforming Weights by DNN 97  
              4.3.3.1 Extraction of GCC Features 98  
              4.3.3.2 Beamforming Weight Vector 100  
           4.3.4 Extraction of Log Mel Filterbanks 100  
           4.3.5 Training Procedure 102  
        4.4 Experiments 103  
           4.4.1 Settings 103  
              4.4.1.1 Corpus 103  
              4.4.1.2 Network Configurations 104  
           4.4.2 Beam Patterns 104  
           4.4.3 Speech Enhancement Results 107  
           4.4.4 Speech Recognition Results 107  
        4.5 Summary and Future Directions 109  
        References 110  
     5 Raw Multichannel Processing Using Deep Neural Networks 113  
        5.1 Introduction 114  
        5.2 Experimental Details 116  
           5.2.1 Data 116  
           5.2.2 Baseline Acoustic Model 117  
        5.3 Multichannel Raw-Waveform Neural Network 118  
           5.3.1 Motivation 118  
           5.3.2 Multichannel Filtering in the Time Domain 119  
           5.3.3 Filterbank Spatial Diversity 120  
           5.3.4 Comparison to Log Mel 123  
           5.3.5 Comparison to Oracle Knowledge of Speech TDOA 124  
           5.3.6 Summary 125  
        5.4 Factoring Spatial and Spectral Selectivity 125  
           5.4.1 Architecture 125  
           5.4.2 Number of Spatial Filters 127  
           5.4.3 Filter Analysis 127  
           5.4.4 Results Summary 129  
        5.5 Adaptive Beamforming 129  
           5.5.1 NAB Model 129  
              5.5.1.1 Adaptive Filters 130  
              5.5.1.2 Gated Feedback 131  
              5.5.1.3 Regularization with MTL 132  
           5.5.2 NAB Filter Analysis 132  
           5.5.3 Results Summary 133  
        5.6 Filtering in the Frequency Domain 134  
           5.6.1 Factored Model 134  
              5.6.1.1 Spatial Filtering 134  
              5.6.1.2 Spectral Filtering: Complex Linear Projection 134  
           5.6.2 NAB Model 135  
           5.6.3 Results: Factored Model 135  
              5.6.3.1 Performance 135  
              5.6.3.2 Comparison Between Learning in Time vs. Frequency 136  
           5.6.4 Results: Adaptive Model 138  
        5.7 Final Comparison, Rerecorded Data 138  
        5.8 Conclusions and Future Work 139  
        References 139  
     6 Novel Deep Architectures in Speech Processing 142  
        6.1 Introduction 143  
           6.1.1 Relationship to the Literature 144  
        6.2 General Formulation of Deep Unfolding 145  
        6.3 Unfolding Markov Random Fields 147  
           6.3.1 Mean-Field Inference 148  
           6.3.2 Belief Propagation 150  
        6.4 Deep Nonnegative Matrix Factorization 152  
        6.5 Multichannel Deep Unfolding 155  
           6.5.1 Source Separation Using Multichannel Gaussian Mixture Model 156  
           6.5.2 Unfolding the Multichannel Gaussian Mixture Model 158  
           6.5.3 MRF Extension of the MCGMM 159  
           6.5.4 Experiments and Discussion 161  
        6.6 End-to-End Deep Clustering 163  
           6.6.1 Deep-Clustering Model 164  
           6.6.2 Optimizing Signal Reconstruction 165  
           6.6.3 End-to-End Training 166  
           6.6.4 Experiments 167  
              6.6.4.1 ASR Performance 167  
        6.7 Conclusion 168  
        References 168  
     7 Deep Recurrent Networks for Separation and Recognition of Single-Channel Speech in Nonstationary Background Audio 172  
        7.1 Introduction 172  
        7.2 Problem Description 173  
        7.3 Learning-Free Methods 175  
        7.4 Nonnegative Matrix Factorization 176  
        7.5 Deep Learning for Source Separation 177  
           7.5.1 Recurrent and Long Short-Term Memory Networks 178  
           7.5.2 Mask Versus Signal Prediction 179  
              7.5.2.1 Ideal Masks and Phase-Sensitive Mask 179  
              7.5.2.2 Evaluating Ideal Masks 180  
           7.5.3 Loss Functions and Inputs 181  
           7.5.4 Phase-Sensitive Approximation Loss Function 182  
           7.5.5 Inputs to the Network 183  
              7.5.5.1 Spectral Features 183  
              7.5.5.2 Speech-State Information 183  
              7.5.5.3 Enhanced Features 184  
        7.6 Experiments and Results 185  
           7.6.1 Neural Network Training 185  
           7.6.2 Results on CHiME-2 186  
           7.6.3 Discussion of Results 191  
        7.7 Conclusion 191  
        References 191  
     8 Robust Features in Deep-Learning-Based Speech Recognition 194  
        8.1 Introduction 195  
        8.2 Background 197  
        8.3 Approaches 198  
           8.3.1 Speech Enhancement 199  
           8.3.2 Signal-Theoretic Techniques 200  
           8.3.3 Perceptually Motivated Features 200  
              8.3.3.1 TempoRAl PatternS (TRAPS) 202  
              8.3.3.2 Frequency-Domain Linear Prediction (FDLP) 203  
              8.3.3.3 Power-Normalized Cepstral Coefficients (PNCC) 204  
              8.3.3.4 Modulation Spectrum Features 204  
              8.3.3.5 Normalized Modulation Coefficient (NMC) 205  
              8.3.3.6 Modulation of Medium Duration Speech Amplitudes (MMeDuSA) 207  
              8.3.3.7 Two Dimensional Modulation Extraction: Gabor Features 209  
              8.3.3.8 Damped Oscillator Coefficient (DOC) 210  
           8.3.4 Current Trends 212  
        8.4 Case Studies 214  
           8.4.1 Speech Processing for Noise- and Channel-Degraded Audio 214  
           8.4.2 Speech Processing Under Reverberated Conditions 215  
        8.5 Conclusion 217  
        References 218  
     9 Adaptation of Deep Neural Network Acoustic Models for Robust Automatic Speech Recognition 225  
        9.1 Introduction 225  
           9.1.1 DNN Adaptation Strategies 226  
              9.1.1.1 Test-Time Adaptation 227  
              9.1.1.2 Attribute-Aware Training 227  
              9.1.1.3 Adaptive Training 227  
           9.1.2 Overview of DNN Adaptation Methods 228  
              9.1.2.1 Constrained Adaptation 228  
              9.1.2.2 Feature Normalisation 228  
              9.1.2.3 Feature Augmentation 229  
              9.1.2.4 Structured DNN Parameterisation 229  
           9.1.3 Chapter Organisation 229  
        9.2 Feature Augmentation 230  
           9.2.1 Speaker-Aware Training 231  
           9.2.2 Noise-Aware Training 232  
           9.2.3 Room-Aware Training 233  
           9.2.4 Multiattribute-Aware Training 234  
           9.2.5 Refinement of Augmented Features 236  
        9.3 Structured DNN Parameterisation 237  
           9.3.1 Structured Bias Vectors 237  
           9.3.2 Structured Linear Transformation Adaptation 238  
           9.3.3 Learning Hidden Unit Contribution 239  
           9.3.4 SVD-Based Structure 239  
           9.3.5 Factorised Hidden Layer Adaptation 240  
           9.3.6 Cluster Adaptive Training for DNNs 241  
        9.4 Summary and Future Directions 243  
        References 244  
     10 Training Data Augmentation and Data Selection 250  
        10.1 Introduction 250  
           10.1.1 Data Augmentation in the Literature 251  
           10.1.2 Complementary Approaches 252  
        10.2 Data Augmentation in Mismatched Environments 253  
           10.2.1 Data Generation 253  
           10.2.2 Speech Enhancement 254  
              10.2.2.1 WPE-Based Dereverberation 254  
              10.2.2.2 Denoising Autoencoder 255  
           10.2.3 Results with Speech Enhancement on Test Data 255  
           10.2.4 Results with Training Data Augmentation 256  
        10.3 Data Selection 257  
           10.3.1 Introduction 257  
           10.3.2 Sequence-Summarizing Neural Network 258  
           10.3.3 Configuration of the Neural Network 260  
           10.3.4 Properties of the Extracted Vectors 261  
           10.3.5 Results with Data Selection 262  
        10.4 Conclusions 263  
        References 263  
     11 Advanced Recurrent Neural Networks for Automatic Speech Recognition 266  
        11.1 Introduction 266  
        11.2 Basic Deep Long Short-Term Memory RNNs 267  
           11.2.1 Long Short-Term Memory RNNs 267  
           11.2.2 Deep LSTM RNNs 268  
        11.3 Prediction–Adaptation–Correction Recurrent Neural Networks 268  
        11.4 Deep Long Short-Term Memory RNN Extensions 270  
           11.4.1 Highway RNNs 270  
           11.4.2 Bidirectional Highway LSTM RNNs 272  
           11.4.3 Latency-Controlled Bidirectional Highway LSTM RNNs 272  
           11.4.4 Grid LSTM RNNs 274  
           11.4.5 Residual LSTM RNNs 275  
        11.5 Experiment Setup 275  
           11.5.1 Corpus 275  
              11.5.1.1 IARPA-Babel Corpus 275  
              11.5.1.2 AMI Meeting Corpus 275  
           11.5.2 System Description 276  
        11.6 Evaluation 277  
           11.6.1 PAC-RNN 277  
              11.6.1.1 Low-Resource Language 277  
              11.6.1.2 Distant Speech Recognition 278  
           11.6.2 Highway LSTMP 279  
              11.6.2.1 Three-Layer Highway (B)LSTMP 279  
              11.6.2.2 Highway (B)LSTMP with Dropout 279  
              11.6.2.3 Deeper Highway LSTMP 280  
              11.6.2.4 Grid LSTMP 280  
              11.6.2.5 Residual LSTMP 281  
              11.6.2.6 Summary of Results 281  
        11.7 Conclusion 282  
        References 283  
     12 Sequence-Discriminative Training of Neural Networks 285  
        12.1 Introduction 285  
        12.2 Training Criteria 287  
           12.2.1 Maximum Mutual Information 287  
           12.2.2 Boosted Maximum Mutual Information 288  
           12.2.3 Minimum Phone Error/State-Level Minimum Bayes Risk 289  
        12.3 Practical Training Strategy 290  
           12.3.1 Criterion Selection 290  
           12.3.2 Frame-Smoothing 291  
           12.3.3 Lattice Generation 292  
              12.3.3.1 Numerator Lattice 292  
              12.3.3.2 Denominator Lattice 293  
        12.4 Two-Forward-Pass Method for Sequence Training 294  
        12.5 Experiment Setup 295  
           12.5.1 Corpus 296  
           12.5.2 System Description 296  
        12.6 Evaluation 297  
           12.6.1 Practical Strategy 297  
           12.6.2 Two-Forward-Pass Method 297  
              12.6.2.1 Speed 298  
              12.6.2.2 Performance 298  
        12.7 Conclusion 299  
        References 300  
     13 End-to-End Architectures for Speech Recognition 302  
        13.1 Introduction 302  
           13.1.1 Complexity and Suboptimality of the Conventional ASR Pipeline 303  
           13.1.2 Simplification of the Conventional ASR Pipeline 305  
           13.1.3 End-to-End Learning 306  
        13.2 End-to-End ASR Architectures 306  
           13.2.1 Connectionist Temporal Classification 307  
           13.2.2 Encoder–Decoder Paradigm 307  
           13.2.3 Learning the Front End 309  
           13.2.4 Other Ideas 310  
        13.3 The EESEN Framework 310  
           13.3.1 Model Structure 311  
           13.3.2 Model Training 312  
           13.3.3 Decoding 314  
              13.3.3.1 Grammar 315  
              13.3.3.2 Lexicon 315  
              13.3.3.3 Token 316  
              13.3.3.4 Search Graph 316  
           13.3.4 Experiments and Analysis 317  
              13.3.4.1 Wall Street Journal 317  
              13.3.4.2 Switchboard 319  
              13.3.4.3 HKUST Mandarin Chinese 320  
        13.4 Summary and Future Directions 321  
        References 322  
  Part III Resources 327  
     14 The CHiME Challenges: Robust Speech Recognition in Everyday Environments 328  
        14.1 Introduction 328  
        14.2 The 1st and 2nd CHiME Challenges (CHiME-1 and CHiME-2) 329  
           14.2.1 Domestic Noise Background 330  
           14.2.2 The Speech Recognition Task Design 330  
              14.2.2.1 CHiME-1: Small Vocabulary 331  
              14.2.2.2 CHiME-2 Track 1: Simulated Motion 331  
              14.2.2.3 CHiME-2 Track 2: Medium Vocabulary 332  
           14.2.3 Overview of System Performance 332  
           14.2.4 Interim Conclusions 333  
        14.3 The 3rd CHiME Challenge (CHiME-3) 334  
           14.3.1 The Mobile Tablet Recordings 334  
           14.3.2 The CHiME-3 Task Design: Real and Simulated Data 335  
           14.3.3 The CHiME-3 Baseline Systems 336  
              14.3.3.1 Simulation 336  
              14.3.3.2 Enhancement 336  
              14.3.3.3 ASR 337  
        14.4 The CHiME-3 Evaluations 337  
           14.4.1 An Overview of CHiME-3 System Performance 338  
           14.4.2 An Overview of Successful Strategies 338  
              14.4.2.1 Strategies for Improved Signal Enhancement 339  
              14.4.2.2 Strategies for Improved Statistical Modelling 339  
              14.4.2.3 Strategies for Improved System Training 340  
           14.4.3 Key Findings 340  
        14.5 Future Directions: CHiME-4 and Beyond 341  
        References 343  
     15 The REVERB Challenge: A Benchmark Task for Reverberation-Robust ASR Techniques 346  
        15.1 Introduction 347  
        15.2 Challenge Scenarios, Data, and Regulations 348  
           15.2.1 Scenarios Assumed in the Challenge 348  
           15.2.2 Data 348  
              15.2.2.1 Test Data: Dev and Eval Test Sets 348  
              15.2.2.2 Training Data 350  
           15.2.3 Regulations 350  
        15.3 Performance of Baseline and Top-Performing Systems 351  
           15.3.1 Benchmark Results with GMM-HMM and DNN-HMM Systems 351  
           15.3.2 Top-Performing 1-ch and 8-ch Systems 352  
           15.3.3 Current State-of-the-Art Performance 353  
        15.4 Summary and Remaining Challenges for Reverberant Speech Recognition 354  
        References 354  
     16 Distant Speech Recognition Experiments Using the AMI Corpus 356  
        16.1 Introduction 356  
        16.2 Meeting Corpora 357  
        16.3 Baseline Speech Recognition Experiments 359  
        16.4 Channel Concatenation Experiments 362  
        16.5 Convolutional Neural Networks 363  
           16.5.1 SDM Recordings 365  
           16.5.2 MDM Recordings 365  
           16.5.3 IHM Recordings 366  
        16.6 Discussion and Conclusions 367  
        References 367  
     17 Toolkits for Robust Speech Processing 370  
        17.1 Introduction 370  
        17.2 General Speech Recognition Toolkits 371  
        17.3 Language Model Toolkits 373  
        17.4 Speech Enhancement Toolkits 375  
        17.5 Deep Learning Toolkits 376  
        17.6 End-to-End Speech Recognition Toolkits 378  
        17.7 Other Resources for Speech Technology 380  
        17.8 Conclusion 380  
        References 381  
  Part IV Applications 384  
     18 Speech Research at Google to Enable Universal Speech Interfaces 385  
        18.1 Early Development 385  
        18.2 Voice Search 387  
        18.3 Text to Speech 387  
        18.4 Dictation/IME/Transcription 388  
        18.5 Internationalization 389  
        18.6 Neural-Network-Based Acoustic Modeling 391  
        18.7 Adaptive Language Modeling 392  
        18.8 Mobile-Device-Specific Technology 393  
        18.9 Robustness 395  
        References 396  
     19 Challenges in and Solutions to Deep Learning Network Acoustic Modeling in Speech Recognition Products at Microsoft 400  
        19.1 Introduction 401  
        19.2 Effective and Efficient DL Modeling 401  
           19.2.1 Reducing Run-Time Cost with SVD-Based Training 402  
           19.2.2 Speaker Adaptation on Small Amount of Parameters 402  
              19.2.2.1 SVD Bottleneck Adaptation 403  
              19.2.2.2 DNN Adaptation Through Activation Function 404  
              19.2.2.3 Low-Rank Plus Diagonal (LRPD) Adaptation 404  
           19.2.3 Improving the Accuracy of Small-Size DNNs with Teacher–Student Training 405  
        19.3 Invariance Modeling 406  
           19.3.1 Improving the Robustness to Accent/Dialect with Model Adaptation 406  
           19.3.2 Improving the Robustness to Acoustic Environment with Variable-Component DNN Modeling 408  
           19.3.3 Improving the Time and Frequency Invariance with Time–Frequency Long Short-Term Memory RNNs 409  
           19.3.4 Exploring the Generalization Capability to Unseen Data with Maximum Margin Sequence Training 409  
        19.4 Effective Training-Data Usage 411  
           19.4.1 Use of Unsupervised Data to Improve SR Accuracy 411  
           19.4.2 Expanded Language Capability by Reusing Speech-Training Material Across Languages 412  
        19.5 Conclusion 413  
        References 414  
     20 Advanced ASR Technologies for Mitsubishi Electric Speech Applications 417  
        20.1 Introduction 417  
        20.2 ASR for Car Navigation Systems 418  
           20.2.1 Introduction 418  
           20.2.2 ASR and Postprocessing Technologies 418  
              20.2.2.1 ASR Using Statistical LM 418  
              20.2.2.2 POI Name Search Using High-Speed Text Search Technique 419  
              20.2.2.3 Application to Commercial Car Navigation System 420  
        20.3 Dereverberation for Hands-Free Elevator 420  
           20.3.1 Introduction 420  
           20.3.2 A Dereverberation Method Using SS 421  
           20.3.3 Experiments 422  
        20.4 Discriminative Methods 423  
           20.4.1 Introduction 423  
           20.4.2 Discriminative Training for AMs 424  
           20.4.3 Discriminative Training for RNN-LM 425  
        20.5 Conclusion 426  
        References 427  
  Index 428  


nach oben


  Mehr zum Inhalt
Kapitelübersicht
Kurzinformation
Inhaltsverzeichnis
Leseprobe
Blick ins Buch
Fragen zu eBooks?

  Navigation
Belletristik / Romane
Computer
Geschichte
Kultur
Medizin / Gesundheit
Philosophie / Religion
Politik
Psychologie / Pädagogik
Ratgeber
Recht
Reise / Hobbys
Sexualität / Erotik
Technik / Wissen
Wirtschaft

  Info
Hier gelangen Sie wieder zum Online-Auftritt Ihrer Bibliothek
© 2008-2024 ciando GmbH | Impressum | Kontakt | F.A.Q. | Datenschutz