Topic: Materials | ELEC-E5510 - Speech Recognition D, 28.10.2020-11.12.2020 | MyCourses

Home Schools Course feedback Service Links Intelliboard

This course space end date is set to 11.12.2020 Search Courses: ELEC-E5510

Topic outline

Materials
Good reading material for the ASR course:

X.D.Huang: Spoken language processing : a guide to theory, algorithm, and system development. Prentice-Hall, 2001

The HTK Book: http://htk.eng.cam.ac.uk/docs/docs.shtml

M. Gales and S. Young: The Application of Hidden Markov Models in Speech Recognition. In Foundations and Trends in Signal Processing Vol. 1, No. 3 (2007) 195-304. http://dx.doi.org/10.1561/2000000004

Additional reading for some special topics in ASR:

Speaker recognition

Speaker recognition: a tutorial Campbell, J.P., Jr.; Proceedings of the IEEE Volume 85, Issue 9, Sept. 1997 Page(s):1437 - 1462
DOI 10.1109/5.628714

Sadaoki Furui: Recent advances in speaker recognition. Pattern Recognition Letters 18(9): 859-872 (1997)

http://cslu.cse.ogi.edu/HLTsurvey/ch1node9.html

Speaker adaptation

Gales, M. J., Young, S. J. The Application of Hidden Markov Models in Speech Recognition. Foundations and Trends in Signal Processing, 2007, vol. 1, no. 3, pp. 241-255.
http://dx.doi.org/10.1561/2000000004

Woodland, P. C. Speaker Adaptation for Continuous Density HMMs: A Review. ITRW on Adaptation Methods for Speech Automatic Recognition, 2001, pp. 11-18.
http://www.isca-speech.org/archive/adaptation/adap_011.html

Wang Zhirong, Schultz Tanja, Waibel Alex, Comparison of Acoustic Model Adaptation Techniques on Non-native Speech. Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE International Conference.

Koichi Shinoda "Speaker Adaptation Techniques for Automatic Speech Recognition" APSIPA ASC 2011 Xi’an Tokyo Institute of Technology, Tokyo, Japan3.

Speech Adaptation
X. Zhu, G. T. Beauregard, and L. L. Wyse, “Real-time signal estimation from modified short-time fourier transform magnitude spectra,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 5, pp. 1645–1653, July 2007.

H. K. Kathania, W. Ahmad, S. Shahnawazuddin, and A. B. Samaddar,“Explicit pitch mapping for improved children’s speech recognition,”Circuits, Systems, and Signal Processing, September 2017.

H. K. Kathania, S. Shahnawazuddin, W. Ahmad, N. Adiga, S. K. Jana, and A. B. Samaddar, “Improving children’s speech recognition through time scale modification based speaking rate adaptation,” in 2018 International Conference on Signal Processing and Communications (SPCOM), July 2018

Audio indexing

Steve Renals, Dave Abberley, David Kirby, Tony Robinson, "Indexing and
retrieval of broadcast news", Speech CommunicationVolume 32, Issues 1-2,
September 2000, Pages 5-20.

John S. Garofolo, Cedric GP Auzanne, Ellen M. Voorhees, "The TREC Spoken Document Retrieval Track: A Success Story" In 8th Text Retrieval
Conference, pages 107--129, Washington, 2000.

Chelba, C.; Hazen, T.J.; Saraclar, M., "Retrieval and browsing of spoken content," Signal Processing Magazine, IEEE , vol.25, no.3, pp.39-49, May
2008.

http://ieeexplore.ieee.org/stam/stamp.jsp?arnumber=4490200isnumber=4490183

DNNs for acoustic modeling
- Kaldi tutorial: Kaldi is currently the most popular ASR toolkit in research. We recommend using it for this project. You don't have to complete the tutorial, but reading it can help you understand some basics.
- Recent Progresses in Deep Learning based Acoustic Models: A very recent survey on acoustic modeling for ASR
- Deep Neural Networks for Acoustic Modeling in Speech Recognition: An older survey on acoustic modeling for ASR
- Purely sequence-trained neural networks for ASR based on lattice-free MMI: A popular sequence discriminative criterion

Pytorch-Kaldi for Acoustic Modeling on Finnish Speech

Suggested Reading:
- THE PYTORCH-KALDI SPEECH RECOGNITION TOOLKIT: The toolkit to be used in the project
- Recent Progresses in Deep Learning based Acoustic Models: A very recent survey on acoustic modeling for ASR
- Deep Neural Networks for Acoustic Modeling in Speech Recognition: An older survey on acoustic modeling for ASR

Language Modeling for Indian Languages

Suggested Reading:
- BUT system for low resource Indian language ASR : This paper uses the same dataset as used in the project.
- Recurrent neural network based language model: A very popular contemporary neural network language model
- IMPROVED BACKING-OFF FOR M-GRAM LANGUAGE MODELING: A popular n-gram language modeling technique
- Simplified Description of the above technique
- LSTM Neural Networks for Language Modeling: Another popular contemporary neural network language model
- Exploring the Limits of Language Modeling: A survey on existing language modeling techniques
- TheanoLM - An extensible toolkit for neural network language modeling: Neural network language modeling toolkit you will use in your work
- On Growing and Pruning Kneser–NeySmoothedN-Gram Models: n-gram language modeling toolkit you will use in your work

Features for ASR

Hynek Hermanskyn "Should recognizers have ears?", Speech Communication 25 (1998) 3-27.

Hynek Hermansky's original article in Journal of Acoustic Society of America 87 (4), 1990.

Confidence measure for ASR

H. Jiang. Condence measures for speech recognition: A survey. Speech communication, 45(4):455{470, 2005.
T. Schaaf and T. Kemp. Condence measures for spontaneous speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP-97, pages 875-878, IEEE, 1997.

JGA Dolfing and A. Wendemuth. Combination of condence measures in isolated word recognition. In Proceedings of the International Conference on
Spoken Language Processing, pages 3237-3240, 1998.

F. Wessel, R. Schluter, K. Macherey, and H. Ney. Condence measures for large vocabulary continuous speech recognition. IEEE Transactions on Speech and Audio Processing, 9(3):288-298, 2001.
T. Fabian. Condence Measurement Techniques in Automatic Speech Recognition and Dialog Management. Der Andere Verlag, 2008.

Language recognition

Li, H. et al. (2013). Spoken Language Recognition: From Fundamentals to Practice. Thorough overview of language recognition.
Tang, Z. et al. (2018). Phonetic Temporal Neural Model for Language Identification. Sections I.A and I.B provide another short overview.
Gonzalez-Dominguez, Javier et al. (2014). Automatic language identification using long short-term memory recurrent neural networks. Deep learning approach with DNNs and LSTMs.
Martínez, David et al. (2011). Language Recognition in iVectors Space. Statistical approach.
Zissman, M. A. (1996). Comparison of four approaches to automatic language identification of telephone speech. Comparing GMMs and PRLM variants.
Muthusamy, Y. K. (1994). Reviewing automatic language identication.
Castaldo, F. et al. (2008). Politecnico di Torino System for the 2007 NIST Language Recognition Evaluation.

Examples of state-of-the-art models:
Shon, Suwon et al. (2018). Convolutional Neural Network and Language Embeddings for End-to-End Dialect Recognition. Model: https://github.com/swshon/dialectID_e2e.
Ma, Zhanyu et al. (2019). Short Utterance Based Speech Language Identification in Intelligent Vehicles With Time-Scale Modifications and Deep Bottleneck Features.

Speech compression for ASR

http://www.data-compression.com/speech.html

B.T.Lilly and K.K. Paliwal, Effect of speech coders on speech recognition performance, Griffith University, Australia.

Juan M. Huerta and Richard M. Stern, Speech recognition from GSM codec parameters, Carnegie Mellon University, USA.

Dan Chazan, Gilad Cohen, Ron Hoory and Meir Zibulski, Low bit rate speech compression for playback in speech recognition systems, in Proc. Eur. Signal Processing Conf

L. Besacier, C. Bergamini, D. Vaufreydaz and E. Castelli, The effect of speech and audio compression on speech recognition performance, In: Proc. IEEE Multimedia Signal Processing Workshop

Speech recognition in noise

Gales, M. J., Young, S. J. The Application of Hidden Markov Models in Speech Recognition. Foundations and Trends in Signal Processing, 2007, vol. 1, no. 3, pp. 241-268. Chapters 5-6.
http://dx.doi.org/10.1561/2000000004

Woodland, P. C. Speaker Adaptation for Continuous Density HMMs: A Review. ITRW on Adaptation Methods for Speech Automatic Recognition, 2001, pp. 11-18.
http://www.isca-speech.org/archive/adaptation/adap_011.html

Voice activity detection

J. Ramírez, J. M. Górriz and J. C. Segura. Voice Activity Detection. Fundamentals and Speech Recognition System Robustness
https://www.intechopen.com/chapter/pdf-download/104

Christian Uhle, book editor: Tom Bäckström Voice Activity Detection. Chapter 13 in Speech Coding
https://aalto.finna.fi/Record/alli.781859 (available as E-book)

Speech synthesis
Speech synthesis resources:
Prof. Simon King’s tutorial at Interspeech 2017, includes videos (recommended)
http://www.speech.zone/courses/one-off/merlin-interspeech2017/

Paul Taylor’s book “Text-to-Speech Synthesis”. Comprehensive reference for HMM-based parametric TTS (but has no neural nets)
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.5905&rep=rep1&type=pdf

Simon King’s (less comprehensive) introduction to HMM-based parametric synthesis:
http://www.cstr.ed.ac.uk/downloads/publications/2010/king_hmm_tutorial.pdf

Code for projects:
Open source DNN-based parametric speech synthesis:
https://github.com/CSTR-Edinburgh/merlin/tree/master/src

Current state-of-the-art: Tacotron and WaveNet (GPU required)
https://google.github.io/tacotron/publications/tacotron2/
Code:
https://github.com/keithito/tacotron
https://github.com/r9y9/wavenet_vocoder

Multichannel ASR

D. V. Compernolle, "DSP techniques for speech enhancement”, in Proc. ESCA Workshop on Speech Processing in Adverse Conditions, 21-30 (1992)

http://www.esat.kuleuven.be/psi/spraak/cgi-bin/get_file.cgi%3F/ps_reconstruct/compi_netrw92.ps%26ps%26dvc:wspac92&sa=X&scisig=AAGBfm2_BW9iTBqp9wNXNOP5gigjiYYCKg&oi=scholarr

J. F. Cardoso, "Blind signal separation: statistical principles”, Proc. IEEE 9, 2009-2025 (1998)
http://perso.telecom-paristech.fr/~cardoso/Papers.PDF/ProcIEEE.pdf

Animation of HMMs

This topic has no external material, but use should study the basic HMM algorithms (Forward-Backward and Viterbi) carefully. Thus, check it from several sources, e.g:
-my lecture slides in 2nd Wednesday's lecture
-Manning & Schutze: pp. 326-327
-Jurafsky & Martin (1st edition): pp. 843-849
-Gales & Young: pp. 205-206 (DOI 10.1109/5.628714)
-Rabiner: "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition" pp. 262-263

Pronunciation model adaptation
Bisani, M., & Ney, H. (2008). Joint-sequence models for grapheme-to-phoneme conversion. Speech communication, 50(5), 434-451.

Maas, A., Xie, Z., Jurafsky, D., & Ng, A. (2015). Lexicon-free conversational speech recognition with neural networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 345-354).

Tools: Sequitar G2P
Materials: WSJ_5k (from exercise 4)

Automatic detection of alcohol intoxication
Wang, W. Y., Biadsy, F., Rosenberg, A., & Hirschberg, J. (2013). Automatic detection of speaker state: Lexical, prosodic, and phonetic approaches to level-of-interest and intoxication classification. Computer Speech & Language, 27(1), 168-189.

Schuller, B., Steidl, S., Batliner, A., Schiel, F., Krajewski, J., Weninger, F., & Eyben, F. (2014). Medium-term speaker states—A review on intoxication, sleepiness and the first challenge. Computer Speech & Language, 28(2), 346-374.

Tools: OpenSMILE, Anaconda?
Materials: Alcohol Language Corpus

Comparing subword language models
Hirsimäki, T., Creutz, M., Siivola, V., Kurimo, M., Virpioja, S., & Pylkkönen, J. (2006). Unlimited vocabulary speech recognition with morph language models applied to Finnish. Computer Speech & Language, 20(4), 515-541.

Tools: SRILM, VariKN, Kaldi, Morfessor
Materials: STT, Kielipankki, Wikipedia

Peter Smit: Modern subword-based models for automatic speech recognition (2019). PhD thesis. Aalto University. https://aaltodoc.aalto.fi/handle/123456789/38073

Sami Virpioja: Learning Constructions of Natural Language: Statistical Models and Evaluations (2012). PhD thesis. Aalto University. https://aaltodoc.aalto.fi/handle/123456789/7294

Chatbots
Zhang, Saizheng, et al. "Personalizing Dialogue Agents: I have a dog, do you have pets too?." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018.

Dinan, Emily, et al. "The second conversational intelligence challenge (convai2)." arXiv preprint arXiv:1902.00098 (2019).

Tools: ParlAI, Hugging Face Transformers
Materials: PersonaChat, OpenSubtitles

https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313
https://parl.ai/

https://convai.io/

Connectionist temporal classification
Awni Hannun, "Sequence Modeling With CTC", https://distill.pub/2017/ctc/
Alex Graces, "Connectionist Temporal Classification: Labelling Unsegmented
Sequence Data with Recurrent Neural Networks", https://www.cs.toronto.edu/~graves/icml_2006.pdf

Attention-based encoder-decoder end-to-end ASR
Chan et al. Listen, attend, and spell
https://arxiv.org/pdf/1508.01211v2.pdf
Chorowski et al. Attention-based models for speech recognition
https://arxiv.org/pdf/1506.07503.pdf
Chiu et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models
https://ieeexplore.ieee.org/abstract/document/8462105
Lüscher et al. RWTH ASR Systems for LibriSpeech: Hybrid vs Attention
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1780.pdf
Watanabe et al. ESPnet: End-to-end speech processing toolkit
https://arxiv.org/pdf/1804.00015.pdf

Data Augmentation
Tom Ko, Vijayaditya Peddinti, Daniel Povey, Sanjeev Khudanpur," Audio Augmentation for Speech Recognition" INTERSPEECH 2015.

Mengjie Qian ; Ian McLoughlin ; Wu Quo ; Lirong Dai "Mismatched training data enhancement for automatic recognition of children's speech using DNN-HMM" 10th International Symposium on Chinese Spoken Language Processing (ISCSLP) 2016. https://ieeexplore.ieee.org/document/7918386

Native Language Recognition
Baseline paper: The INTERSPEECH 2016 Computational Paralinguistics Challenge: Deception, Sincerity & Native Language https://www.isca-speech.org/archive/Interspeech_2016/pdfs/0129.PDF
Winner paper: Exploiting phone log-likelihood ratio features for the detection of the native language of non-native English speakers (https://www.isca-speech.org/archive/Interspeech_2016/pdfs/1491.PDF)
Other papers available here: https://www.isca-speech.org/archive/Interspeech_2016/ (check the Special Session: Interspeech 2016 Computational Paralinguistics Challenge (ComParE): Deception, Sincerity & Native Language blocks)

Deep Denoising Autoencoder for Speech Enhancement
Speech Enhancement Based on Deep Denoising Autoencoder, https://www.isca-speech.org/archive/archive_papers/interspeech_2013/i13_0436.pdf
Exploring multi-channel features for denoising-autoencoder-based speech enhancement (if you decide to use multichannel data)
Joint training of front-end and back-end deep neural networks for robust speech recognition
Tensorflow implementation: https://github.com/jonlu0602/DeepDenoisingAutoencoder

Language model adaptation
N-gram models adaptation techniques:
TKK Master's thesis by Simo Broman pp. 9-41
http://cis.legacy.ics.tkk.fi/sbroman/dippa.pdf
Master's thesis by Andre Mansikkaniemi pp. 47-53
http://lib.tkk.fi/Dipl/2010/urn100143.pdf
Statistical language model adaptation: review and perspectives
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.4893&rep=rep1&type=pdf
Neural language model adaptation techniques:
Approaches for Neural-Network Language Model Adaptation
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46439.pdf
Recurrent Neural Network Language Model Adaptation for Conversational Speech Recognition
http://www.danielpovey.com/files/2018_interspeech_lm_adapt.pdf
- Select activity Mapping of Lectures (2020) into pages in Jurafsky's book
  
  Mapping of Lectures (2020) into pages in Jurafsky's book File TXT
  
  SPEECH and LANGUAGE PROCESSING
  An Introduction to Natural Language Processing,
  Computational Linguistics, and Speech Recognition
  by Daniel Jurafsky and James H. Martin