ELEC-E5510 - Speech Recognition D, 28.10.2020-11.12.2020
Kurssiasetusten perusteella kurssi on päättynyt 11.12.2020 Etsi kursseja: ELEC-E5510
Osion kuvaus
-
Good reading material for the ASR course:
X.D.Huang: Spoken language processing : a guide to theory, algorithm, and system development. Prentice-Hall, 2001
The HTK Book: http://htk.eng.cam.ac.uk/docs/docs.shtml
M. Gales and S. Young: The Application of Hidden Markov Models in Speech Recognition. In Foundations and Trends in Signal Processing Vol. 1, No. 3 (2007) 195-304. http://dx.doi.org/10.1561/2000000004
Additional reading for some special topics in ASR:
Speaker recognition
Speaker recognition: a tutorial Campbell, J.P., Jr.; Proceedings of the IEEE Volume 85, Issue 9, Sept. 1997 Page(s):1437 - 1462
DOI 10.1109/5.628714Sadaoki Furui: Recent advances in speaker recognition. Pattern Recognition Letters 18(9): 859-872 (1997)
http://cslu.cse.ogi.edu/HLTsurvey/ch1node9.htmlSpeaker adaptation
Gales, M. J., Young, S. J. The Application of Hidden Markov Models in Speech Recognition. Foundations and Trends in Signal Processing, 2007, vol. 1, no. 3, pp. 241-255.
http://dx.doi.org/10.1561/2000000004
Woodland, P. C. Speaker Adaptation for Continuous Density HMMs: A Review. ITRW on Adaptation Methods for Speech Automatic Recognition, 2001, pp. 11-18.
http://www.isca-speech.org/archive/adaptation/adap_011.html
Wang Zhirong, Schultz Tanja, Waibel Alex, Comparison of Acoustic Model Adaptation Techniques on Non-native Speech. Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE International Conference.
Koichi Shinoda"Speaker Adaptation Techniques for Automatic Speech Recognition" APSIPA ASC 2011 Xi’an Tokyo Institute of Technology, Tokyo, Japan3. Speech Adaptation
X. Zhu, G. T. Beauregard, and L. L. Wyse, “Real-time signal estimation from modified short-time fourier transform magnitude spectra,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 5, pp. 1645–1653, July 2007.H. K. Kathania, W. Ahmad, S. Shahnawazuddin, and A. B. Samaddar,“Explicit pitch mapping for improved children’s speech recognition,”Circuits, Systems, and Signal Processing, September 2017.H. K. Kathania, S. Shahnawazuddin, W. Ahmad, N. Adiga, S. K. Jana, and A. B. Samaddar, “Improving children’s speech recognition through time scale modification based speaking rate adaptation,” in 2018 International Conference on Signal Processing and Communications (SPCOM), July 2018Audio indexing
Steve Renals, Dave Abberley, David Kirby, Tony Robinson, "Indexing and
retrieval of broadcast news", Speech CommunicationVolume 32, Issues 1-2,
September 2000, Pages 5-20.
John S. Garofolo, Cedric GP Auzanne, Ellen M. Voorhees, "The TREC Spoken Document Retrieval Track: A Success Story" In 8th Text Retrieval
Conference, pages 107--129, Washington, 2000.
Chelba, C.; Hazen, T.J.; Saraclar, M., "Retrieval and browsing of spoken content," Signal Processing Magazine, IEEE , vol.25, no.3, pp.39-49, May
2008.
http://ieeexplore.ieee.org/stam/stamp.jsp?arnumber=4490200isnumber=4490183DNNs for acoustic modeling
- Kaldi tutorial: Kaldi is currently the most popular ASR toolkit in research. We recommend using it for this project. You don't have to complete the tutorial, but reading it can help you understand some basics.- Recent Progresses in Deep Learning based Acoustic Models: A very recent survey on acoustic modeling for ASR- Deep Neural Networks for Acoustic Modeling in Speech Recognition: An older survey on acoustic modeling for ASR- Purely sequence-trained neural networks for ASR based on lattice-free MMI: A popular sequence discriminative criterionPytorch-Kaldi for Acoustic Modeling on Finnish Speech
Suggested Reading:- THE PYTORCH-KALDI SPEECH RECOGNITION TOOLKIT: The toolkit to be used in the project- Recent Progresses in Deep Learning based Acoustic Models: A very recent survey on acoustic modeling for ASR- Deep Neural Networks for Acoustic Modeling in Speech Recognition: An older survey on acoustic modeling for ASRLanguage Modeling for Indian Languages
Suggested Reading:- BUT system for low resource Indian language ASR : This paper uses the same dataset as used in the project.- Recurrent neural network based language model: A very popular contemporary neural network language model- IMPROVED BACKING-OFF FOR M-GRAM LANGUAGE MODELING: A popular n-gram language modeling technique- LSTM Neural Networks for Language Modeling: Another popular contemporary neural network language model- Exploring the Limits of Language Modeling: A survey on existing language modeling techniques- TheanoLM - An extensible toolkit for neural network language modeling: Neural network language modeling toolkit you will use in your work- On Growing and Pruning Kneser–NeySmoothedN-Gram Models: n-gram language modeling toolkit you will use in your workFeatures for ASR
Hynek Hermanskyn "Should recognizers have ears?", Speech Communication 25 (1998) 3-27.
Hynek Hermansky's original article in Journal of Acoustic Society of America 87 (4), 1990.
Confidence measure for ASR
H. Jiang. Condence measures for speech recognition: A survey. Speech communication, 45(4):455{470, 2005.
T. Schaaf and T. Kemp. Condence measures for spontaneous speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP-97, pages 875-878, IEEE, 1997.JGA Dolfing and A. Wendemuth. Combination of condence measures in isolated word recognition. In Proceedings of the International Conference onSpoken Language Processing, pages 3237-3240, 1998.
F. Wessel, R. Schluter, K. Macherey, and H. Ney. Condence measures for large vocabulary continuous speech recognition. IEEE Transactions on Speech and Audio Processing, 9(3):288-298, 2001.
T. Fabian. Condence Measurement Techniques in Automatic Speech Recognition and Dialog Management. Der Andere Verlag, 2008.Language recognition
Li, H. et al. (2013). Spoken Language Recognition: From Fundamentals to Practice. Thorough overview of language recognition.Tang, Z. et al. (2018). Phonetic Temporal Neural Model for Language Identification. Sections I.A and I.B provide another short overview.Gonzalez-Dominguez, Javier et al. (2014). Automatic language identification using long short-term memory recurrent neural networks. Deep learning approach with DNNs and LSTMs.Martínez, David et al. (2011). Language Recognition in iVectors Space. Statistical approach.Zissman, M. A. (1996). Comparison of four approaches to automatic language identification of telephone speech. Comparing GMMs and PRLM variants.Muthusamy, Y. K. (1994). Reviewing automatic language identication.Castaldo, F. et al. (2008). Politecnico di Torino System for the 2007 NIST Language Recognition Evaluation.
Examples of state-of-the-art models:
Shon, Suwon et al. (2018). Convolutional Neural Network and Language Embeddings for End-to-End Dialect Recognition. Model: https://github.com/swshon/dialectID_e2e.
Ma, Zhanyu et al. (2019). Short Utterance Based Speech Language Identification in Intelligent Vehicles With Time-Scale Modifications and Deep Bottleneck Features.Speech compression for ASR
http://www.data-compression.com/speech.html
B.T.Lilly and K.K. Paliwal, Effect of speech coders on speech recognition performance, Griffith University, Australia.
Juan M. Huerta and Richard M. Stern, Speech recognition from GSM codec parameters, Carnegie Mellon University, USA.
Dan Chazan, Gilad Cohen, Ron Hoory and Meir Zibulski, Low bit rate speech compression for playback in speech recognition systems, in Proc. Eur. Signal Processing Conf
L. Besacier, C. Bergamini, D. Vaufreydaz and E. Castelli, The effect of speech and audio compression on speech recognition performance, In: Proc. IEEE Multimedia Signal Processing WorkshopSpeech recognition in noise
Gales, M. J., Young, S. J. The Application of Hidden Markov Models in Speech Recognition. Foundations and Trends in Signal Processing, 2007, vol. 1, no. 3, pp. 241-268. Chapters 5-6.
http://dx.doi.org/10.1561/2000000004
Woodland, P. C. Speaker Adaptation for Continuous Density HMMs: A Review. ITRW on Adaptation Methods for Speech Automatic Recognition, 2001, pp. 11-18.
http://www.isca-speech.org/archive/adaptation/adap_011.html
Voice activity detection
J. Ramírez, J. M. Górriz and J. C. Segura. Voice Activity Detection. Fundamentals and Speech Recognition System Robustnesshttps://www.intechopen.com/chapter/pdf-download/104Christian Uhle, book editor: Tom Bäckström Voice Activity Detection. Chapter 13 in Speech Codinghttps://aalto.finna.fi/Record/alli.781859 (available as E-book)Speech synthesis
Speech synthesis resources:Prof. Simon King’s tutorial at Interspeech 2017, includes videos (recommended)
http://www.speech.zone/courses/one-off/merlin-interspeech2017/Paul Taylor’s book “Text-to-Speech Synthesis”. Comprehensive reference for HMM-based parametric TTS (but has no neural nets)
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.5905&rep=rep1&type=pdf
Simon King’s (less comprehensive) introduction to HMM-based parametric synthesis:
http://www.cstr.ed.ac.uk/downloads/publications/2010/king_hmm_tutorial.pdfCode for projects:
Open source DNN-based parametric speech synthesis:
https://github.com/CSTR-Edinburgh/merlin/tree/master/src
Current state-of-the-art: Tacotron and WaveNet (GPU required)
https://google.github.io/tacotron/publications/tacotron2/
Code:
https://github.com/keithito/tacotron
https://github.com/r9y9/wavenet_vocoder
Multichannel ASR
D. V. Compernolle, "DSP techniques for speech enhancement”, in Proc. ESCA Workshop on Speech Processing in Adverse Conditions, 21-30 (1992)
http://www.esat.kuleuven.be/psi/spraak/cgi-bin/get_file.cgi%3F/ps_reconstruct/compi_netrw92.ps%26ps%26dvc:wspac92&sa=X&scisig=AAGBfm2_BW9iTBqp9wNXNOP5gigjiYYCKg&oi=scholarr
J. F. Cardoso, "Blind signal separation: statistical principles”, Proc. IEEE 9, 2009-2025 (1998)
http://perso.telecom-paristech.fr/~cardoso/Papers.PDF/ProcIEEE.pdfAnimation of HMMs
This topic has no external material, but use should study the basic HMM algorithms (Forward-Backward and Viterbi) carefully. Thus, check it from several sources, e.g:
-my lecture slides in 2nd Wednesday's lecture
-Manning & Schutze: pp. 326-327
-Jurafsky & Martin (1st edition): pp. 843-849
-Gales & Young: pp. 205-206 (DOI 10.1109/5.628714)
-Rabiner: "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition" pp. 262-263Pronunciation model adaptation
Bisani, M., & Ney, H. (2008). Joint-sequence models for grapheme-to-phoneme conversion. Speech communication, 50(5), 434-451.Maas, A., Xie, Z., Jurafsky, D., & Ng, A. (2015). Lexicon-free conversational speech recognition with neural networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 345-354).Tools: Sequitar G2PMaterials: WSJ_5k (from exercise 4)Wang, W. Y., Biadsy, F., Rosenberg, A., & Hirschberg, J. (2013). Automatic detection of speaker state: Lexical, prosodic, and phonetic approaches to level-of-interest and intoxication classification. Computer Speech & Language, 27(1), 168-189.Automatic detection of alcohol intoxication
Schuller, B., Steidl, S., Batliner, A., Schiel, F., Krajewski, J., Weninger, F., & Eyben, F. (2014). Medium-term speaker states—A review on intoxication, sleepiness and the first challenge. Computer Speech & Language, 28(2), 346-374.Tools: OpenSMILE, Anaconda?Materials: Alcohol Language CorpusComparing subword language models
Hirsimäki, T., Creutz, M., Siivola, V., Kurimo, M., Virpioja, S., & Pylkkönen, J. (2006). Unlimited vocabulary speech recognition with morph language models applied to Finnish. Computer Speech & Language, 20(4), 515-541.Tools: SRILM, VariKN, Kaldi, MorfessorMaterials: STT, Kielipankki, WikipediaPeter Smit: Modern subword-based models for automatic speech recognition (2019). PhD thesis. Aalto University. https://aaltodoc.aalto.fi/handle/123456789/38073Sami Virpioja: Learning Constructions of Natural Language: Statistical Models and Evaluations (2012). PhD thesis. Aalto University. https://aaltodoc.aalto.fi/handle/123456789/7294https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313Chatbots
Zhang, Saizheng, et al. "Personalizing Dialogue Agents: I have a dog, do you have pets too?." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018.Dinan, Emily, et al. "The second conversational intelligence challenge (convai2)." arXiv preprint arXiv:1902.00098 (2019).Tools: ParlAI, Hugging Face TransformersMaterials: PersonaChat, OpenSubtitleshttps://parl.ai/https://convai.io/Connectionist temporal classification
Alex Graces, "Connectionist Temporal Classification: Labelling UnsegmentedAwni Hannun, "Sequence Modeling With CTC", https://distill.pub/2017/ctc/
Sequence Data with Recurrent Neural Networks", https://www.cs.toronto.edu/~graves/icml_2006.pdfAttention-based encoder-decoder end-to-end ASR
Chan et al. Listen, attend, and spellChorowski et al. Attention-based models for speech recognitionChiu et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Modelshttps://ieeexplore.ieee.org/abstract/document/8462105Lüscher et al. RWTH ASR Systems for LibriSpeech: Hybrid vs Attentionhttps://www.isca-speech.org/archive/Interspeech_2019/pdfs/1780.pdfWatanabe et al. ESPnet: End-to-end speech processing toolkitData Augmentation
Tom Ko, Vijayaditya Peddinti, Daniel Povey, Sanjeev Khudanpur," Audio Augmentation for Speech Recognition" INTERSPEECH 2015.Mengjie Qian ; Ian McLoughlin ; Wu Quo ; Lirong Dai "Mismatched training data enhancement for automatic recognition of children's speech using DNN-HMM" 10th International Symposium on Chinese Spoken Language Processing (ISCSLP) 2016. https://ieeexplore.ieee.org/document/7918386 Native Language Recognition
Baseline paper: The INTERSPEECH 2016 Computational Paralinguistics Challenge: Deception, Sincerity & Native Language https://www.isca-speech.org/archive/Interspeech_2016/pdfs/0129.PDF
Winner paper: Exploiting phone log-likelihood ratio features for the detection of the native language of non-native English speakers (https://www.isca-speech.org/archive/Interspeech_2016/pdfs/1491.PDF)Other papers available here: https://www.isca-speech.org/archive/Interspeech_2016/ (check the Special Session: Interspeech 2016 Computational Paralinguistics Challenge (ComParE): Deception, Sincerity & Native Language blocks)Deep Denoising Autoencoder for Speech Enhancement
Speech Enhancement Based on Deep Denoising Autoencoder, https://www.isca-speech.org/archive/archive_papers/interspeech_2013/i13_0436.pdf
Exploring multi-channel features for denoising-autoencoder-based speech enhancement (if you decide to use multichannel data)
Joint training of front-end and back-end deep neural networks for robust speech recognitionTensorflow implementation: https://github.com/jonlu0602/DeepDenoisingAutoencoderLanguage model adaptation
N-gram models adaptation techniques:
TKK Master's thesis by Simo Broman pp. 9-41
http://cis.legacy.ics.tkk.fi/sbroman/dippa.pdf
Master's thesis by Andre Mansikkaniemi pp. 47-53
http://lib.tkk.fi/Dipl/2010/urn100143.pdf
Statistical language model adaptation: review and perspectives
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.4893&rep=rep1&type=pdfNeural language model adaptation techniques:
Approaches for Neural-Network Language Model Adaptation
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46439.pdf
Recurrent Neural Network Language Model Adaptation for Conversational Speech Recognition
http://www.danielpovey.com/files/2018_interspeech_lm_adapt.pdf-
SPEECH and LANGUAGE PROCESSING
An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition
by Daniel Jurafsky and James H. Martin
-