Reading materials

Site:	MyCourses
Course:	ELEC-E5510 - Speech Recognition D, Lecture, 25.10.2023-8.12.2023
Book:	Reading materials

Printed by:	Guest user
Date:	Thursday, 3 April 2025, 10:26 PM

Description

Collection of reading materials by topic.

1. Course readings
2. Material on project topics

1. Course readings

Good reading material for the ASR course:

X. D. Huang: Spoken language processing : a guide to theory, algorithm, and system development. Prentice-Hall, 2001
The HTK Book
M. Gales and S. Young: The Application of Hidden Markov Models in Speech Recognition. In Foundations and Trends in Signal Processing Vol. 1, No. 3 (2007) 195-304.

2. Material on project topics

Reading materials for project work organized by topic.

2.1. Language model adaptation

N-gram models adaptation techniques:

TKK Master's thesis by Simo Broman pp. 9-41
Master's thesis by Andre Mansikkaniemi pp. 47-53
Bellegarda, J. R. (2004). Statistical language model adaptation: review and perspectives. Speech communication, 42(1), 93-108.

Neural language model adaptation techniques:

Ma, M., Nirschl, M., Biadsy, F., & Kumar, S. (2017). Approaches for Neural-Network Language Model Adaptation. In INTERSPEECH (pp. 259-263).
Li, K., Xu, H., Wang, Y., Povey, D., & Khudanpur, S. (2018). Recurrent Neural Network Language Model Adaptation for Conversational Speech Recognition. In Interspeech (Vol. 2018, pp. 3373-3377).

2.2. Audio event tagging

Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., ... & Ritter, M. (2017). Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 776-780). IEEE.
Dang, A., Vu, T. H., & Wang, J. C. (2017). A survey of deep learning for polyphonic sound event detection. In 2017 International Conference on Orange Technologies (ICOT) (pp. 75-78). IEEE.
Babaee, E., Anuar, N. B., Abdul Wahab, A. W., Shamshirband, S., & Chronopoulos, A. T. (2017). An overview of audio event detection methods from feature extraction to classification. Applied Artificial Intelligence, 31(9-10), 661-714.
Guo, G., & Li, S. Z. (2003). Content-based audio classification and retrieval by support vector machines. IEEE transactions on Neural Networks, 14(1), 209-215.
Ma, L., Milner, B., & Smith, D. (2006). Acoustic environment classification. ACM Transactions on Speech and Language Processing (TSLP), 3(2), 1-22.
Wolfram tutorial - Audio Analysis with Neural Networks

2.3. Language recognition

Li, H. et al. (2013). Spoken Language Recognition: From Fundamentals to Practice. Thorough overview of language recognition.
Tang, Z. et al. (2018). Phonetic Temporal Neural Model for Language Identification. Sections I.A and I.B provide another short overview.
Gonzalez-Dominguez, Javier et al. (2014). Automatic language identification using long short-term memory recurrent neural networks. Deep learning approach with DNNs and LSTMs.
Martínez, David et al. (2011). Language Recognition in iVectors Space. Statistical approach.
Zissman, M. A. (1996). Comparison of four approaches to automatic language identification of telephone speech. Comparing GMMs and PRLM variants.
Muthusamy, Y. K. (1994). Reviewing automatic language identication.
Castaldo, F. et al. (2008). Politecnico di Torino System for the 2007 NIST Language Recognition Evaluation.

Examples of state-of-the-art models:

Shon, Suwon et al. (2018). Convolutional Neural Network and Language Embeddings for End-to-End Dialect Recognition. Model: https://github.com/swshon/dialectID_e2e.
Ma, Zhanyu et al. (2019). Short Utterance Based Speech Language Identification in Intelligent Vehicles With Time-Scale Modifications and Deep Bottleneck Features.

2.4. Speech command recognition

Warden, P. (2018). Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209.
Matlab Tutorial - Speech Command Recognition Using Deep Learning
Reich, D., Putze, F., Heger, D., Ijsselmuiden, J., Stiefelhagen, R., & Schultz, T. (2011). A real-time speech command detector for a smart control room. In Twelfth Annual Conference of the International Speech Communication Association.
Furui, S. (1986). Speaker-independent isolated word recognition using dynamic features of speech spectrum. IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(1), 52-59.
Prabhavalkar, R., Keshet, J., Livescu, K., & Fosler-Lussier, E. (2012). Discriminative spoken term detection with limited data. In Symposium on Machine Learning in Speech and Language Processing.
Fernández, S., Graves, A., & Schmidhuber, J. (2007). An application of recurrent neural networks to discriminative keyword spotting. In International Conference on Artificial Neural Networks (pp. 220-229). Springer, Berlin, Heidelberg.

2.5. Language Modeling for Indian Languages

BUT system for low resource Indian language ASR : This paper uses the same dataset as used in the project.
LSTM Neural Networks for Language Modeling: Another popular contemporary neural network language model
Exploring the Limits of Language Modeling: A survey on existing language modeling techniques
TheanoLM - An extensible toolkit for neural network language modeling: Neural network language modeling toolkit you will use in your work
On Growing and Pruning Kneser–NeySmoothedN-Gram Models: n-gram language modeling toolkit you will use in your work
Simplified Description of the above technique
IMPROVED BACKING-OFF FOR M-GRAM LANGUAGE MODELING: A popular n-gram language modeling technique
Recurrent neural network based language model: A very popular contemporary neural network language model

2.6. Automatic detection of alcohol intoxication

Wang, W. Y., Biadsy, F., Rosenberg, A., & Hirschberg, J. (2013). Automatic detection of speaker state: Lexical, prosodic, and phonetic approaches to level-of-interest and intoxication classification. Computer Speech & Language, 27(1), 168-189.
Schuller, B., Steidl, S., Batliner, A., Schiel, F., Krajewski, J., Weninger, F., & Eyben, F. (2014). Medium-term speaker states—A review on intoxication, sleepiness and the first challenge. Computer Speech & Language, 28(2), 346-374.
Tools: OpenSMILE, Anaconda?
Materials: Alcohol Language Corpus

2.7. Speech adaptation

X. Zhu, G. T. Beauregard, and L. L. Wyse: Real-time signal estimation from modified short-time fourier transform magnitude spectra. IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 5, pp. 1645–1653, July 2007.
H. K. Kathania, S. Shahnawazuddin, W. Ahmad, N. Adiga, S. K. Jana, and A. B. Samaddar: Improving children’s speech recognition through time scale modification based speaking rate adaptation. in 2018 International Conference on Signal Processing and Communications (SPCOM), July 2018
H. K. Kathania, W. Ahmad, S. Shahnawazuddin, and A. B. Samaddar: Explicit pitch mapping for improved children’s speech recognition. Circuits, Systems, and Signal Processing, September 2017.

2.8. Speaker adaptation

Gales, M. J., Young, S. J. The Application of Hidden Markov Models in Speech Recognition. Foundations and Trends in Signal Processing, 2007, vol. 1, no. 3, pp. 241-255.
Woodland, P. C. Speaker Adaptation for Continuous Density HMMs: A Review. ITRW on Adaptation Methods for Speech Automatic Recognition, 2001, pp. 11-18.
Wang Zhirong, Schultz Tanja, Waibel Alex, Comparison of Acoustic Model Adaptation Techniques on Non-native Speech. Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE International Conference.
Koichi Shinoda "Speaker Adaptation Techniques for Automatic Speech Recognition" APSIPA ASC 2011 Xi’an Tokyo Institute of Technology, Tokyo, Japan3.

2.9. Deep denoising autoencoder for speech enhancement

Lu, X., Tsao, Y., Matsuda, S., & Hori, C. (2013). Speech enhancement based on deep denoising autoencoder. In Interspeech (Vol. 2013, pp. 436-440).
Araki, S., Hayashi, T., Delcroix, M., Fujimoto, M., Takeda, K., & Nakatani, T. (2015). Exploring multi-channel features for denoising-autoencoder-based speech enhancement. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 116-120). IEEE. (if you decide to use multichannel data)
Gao, T., Du, J., Dai, L. R., & Lee, C. H. (2015). Joint training of front-end and back-end deep neural networks for robust speech recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4375-4379). IEEE.
Tensorflow implementation

2.10. Native language recognition

Schuller, Björn, et al. The interspeech 2016 computational paralinguistics challenge: Deception, sincerity & native language. 17TH Annual Conference of the International Speech Communication Association (Interspeech 2016), Vols 1-5. 2016. (baseline paper)
Abad, Alberto, et al. Exploiting Phone Log-Likelihood Ratio Features for the Detection of the Native Language of Non-Native English Speakers. INTERSPEECH. 2016. (winning paper)
Other papers available here. Check the special session: Interspeech 2016 Computational Paralinguistics Challenge (ComParE): Deception, Sincerity & Native Language blocks

2.11. Chatbots

Zhang, Saizheng, et al. Personalizing Dialogue Agents: I have a dog, do you have pets too?. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018.
Dinan, Emily, et al. The second conversational intelligence challenge (convai2). arXiv preprint arXiv:1902.00098 (2019).
Tools: ParlAI, Hugging Face Transformers, Rasa
Materials: PersonaChat, OpenSubtitles
https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313
https://parl.ai/
https://convai.io/

2.12. Comparing subword language models

Hirsimäki, T., Creutz, M., Siivola, V., Kurimo, M., Virpioja, S., & Pylkkönen, J. (2006). Unlimited vocabulary speech recognition with morph language models applied to Finnish. Computer Speech & Language, 20(4), 515-541.
Peter Smit: Modern subword-based models for automatic speech recognition (2019). PhD thesis. Aalto University.
Sami Virpioja: Learning Constructions of Natural Language: Statistical Models and Evaluations (2012). PhD thesis. Aalto University.
Tools: SRILM, VariKN, Kaldi, Morfessor
Materials: STT, Kielipankki, Wikipedia

2.13. Speaker recognition

Campbell, J.P., Jr: Speaker recognition: a tutorial. Proceedings of the IEEE Volume 85, Issue 9, Sept. 1997 Page(s): 1437 - 1462
Sadaoki Furui: Recent advances in speaker recognition. Pattern Recognition Letters 18(9): 859-872 (1997)
Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., and Ouellet, P. (2010). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4):788–798.
Snyder, D., Garcia-Romero, D., Povey, D., & Khudanpur, S. (2017, August). Deep neural network embeddings for text-independent speaker verification. In Interspeech (pp. 999-1003).
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., and Khudanpur, S. (2018). X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5329–5333. IEEE.

2.14. Voice activity detection

2.15. DNNs for acoustic modeling

Kaldi tutorial: Kaldi is currently the most popular ASR toolkit in research. We recommend using it for this project. You don't have to complete the tutorial, but reading it can help you understand some basics.
Ravanelli, Mirco, Titouan Parcollet, and Yoshua Bengio. The pytorch-kaldi speech recognition toolkit. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.
Yu, Dong, and Jinyu Li. Recent progresses in deep learning based acoustic models. IEEE/CAA Journal of automatica sinica 4.3 (2017): 396-409.
Hinton, Geoffrey, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine 29.6 (2012): 82-97.
Povey, Daniel, et al. Purely sequence-trained neural networks for ASR based on lattice-free MMI. Interspeech. 2016.

2.16. Connectionist temporal classification

Awni Hannun, "Sequence Modeling With CTC" Distill 2.11 (2017): e8.
Graves, Alex, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. Proceedings of the 23rd international conference on Machine learning. 2006.

2.17. Attention-based ASR

Chan, W., Jaitly, N., Le, Q. V., & Vinyals, O. (2015). Listen, attend and spell. arXiv preprint arXiv:1508.01211.
Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., & Bengio, Y. (2015). Attention-based models for speech recognition. arXiv preprint arXiv:1506.07503.
Chiu, C. C., Sainath, T. N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., ... & Bacchiani, M. (2018). State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4774-4778). IEEE.
Lüscher, C., Beck, E., Irie, K., Kitza, M., Michel, W., Zeyer, A., ... & Ney, H. (2019). RWTH ASR Systems for LibriSpeech: Hybrid vs Attention--w/o Data Augmentation. arXiv preprint arXiv:1905.03072.
Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., ... & Ochiai, T. (2018). Espnet: End-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015. (The Espnet toolkit is another toolkit you could use besides Speechbrain.)
Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lugosch, L., ... & Bengio, Y. (2021). SpeechBrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624.

2.18. Data augmentation

Tom Ko, Vijayaditya Peddinti, Daniel Povey, Sanjeev Khudanpur: Audio Augmentation for Speech Recognition. INTERSPEECH 2015.
Mengjie Qian ; Ian McLoughlin ; Wu Quo ; Lirong Dai: Mismatched training data enhancement for automatic recognition of children's speech using DNN-HMM. 10th International Symposium on Chinese Spoken Language Processing (ISCSLP) 2016.

2.19. Speech synthesis

Prof. Simon King’s tutorial at Interspeech 2017, includes videos (recommended)
Paul Taylor’s book “Text-to-Speech Synthesis”. Comprehensive reference for HMM-based parametric TTS (but has no neural nets)
Simon King’s (less comprehensive) introduction to HMM-based parametric synthesis

Code for projects:

2.20. Speech compression

http://www.data-compression.com/speech.html
B.T.Lilly and K.K. Paliwal, Effect of speech coders on speech recognition performance, Griffith University, Australia.
Juan M. Huerta and Richard M. Stern, Speech recognition from GSM codec parameters, Carnegie Mellon University, USA.
Dan Chazan, Gilad Cohen, Ron Hoory and Meir Zibulski, Low bit rate speech compression for playback in speech recognition systems, in Proc. Eur. Signal Processing Conf
L. Besacier, C. Bergamini, D. Vaufreydaz and E. Castelli, The effect of speech and audio compression on speech recognition performance, In: Proc. IEEE Multimedia Signal Processing Workshop

2.21. Speech recognition in noise

Gales, M. J., Young, S. J. The Application of Hidden Markov Models in Speech Recognition. Foundations and Trends in Signal Processing, 2007, vol. 1, no. 3, pp. 241-268. Chapters 5-6.
Woodland, P. C. Speaker Adaptation for Continuous Density HMMs: A Review. ITRW on Adaptation Methods for Speech Automatic Recognition, 2001, pp. 11-18.

2.22. Confidence measures for ASR

H. Jiang. Condence measures for speech recognition: A survey. Speech communication, 45(4):455{470, 2005.
T. Schaaf and T. Kemp. Condence measures for spontaneous speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP-97, pages 875-878, IEEE, 1997.
JGA Dolfing and A. Wendemuth. Combination of condence measures in isolated word recognition. In Proceedings of the International Conference on
Spoken Language Processing, pages 3237-3240, 1998.
F. Wessel, R. Schluter, K. Macherey, and H. Ney. Condence measures for large vocabulary continuous speech recognition. IEEE Transactions on Speech and Audio Processing, 9(3):288-298, 2001.
T. Fabian. Condence Measurement Techniques in Automatic Speech Recognition and Dialog Management. Der Andere Verlag, 2008.

2.23. Features for ASR

Hynek Hermanskyn Should recognizers have ears? Speech Communication 25 (1998) 3-27.
Hynek Hermansky's original article in Journal of Acoustic Society of America 87 (4), 1990.

2.24. Multichannel ASR

D. V. Compernolle, "DSP techniques for speech enhancement”, in Proc. ESCA Workshop on Speech Processing in Adverse Conditions, 21-30 (1992)
J. F. Cardoso, "Blind signal separation: statistical principles”, Proc. IEEE 9, 2009-2025 (1998)

2.25. Pronunciation model adaptation

Bisani, M., & Ney, H. (2008). Joint-sequence models for grapheme-to-phoneme conversion. Speech communication, 50(5), 434-451.
Maas, A., Xie, Z., Jurafsky, D., & Ng, A. (2015). Lexicon-free conversational speech recognition with neural networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 345-354).
Tools: Sequitar G2P
Materials: WSJ_5k (from exercise 4)

2.26. Audio indexing

Steve Renals, Dave Abberley, David Kirby, Tony Robinson: Indexing and
retrieval of broadcast news. Speech CommunicationVolume 32, Issues 1-2,
September 2000, Pages 5-20.
John S. Garofolo, Cedric GP Auzanne, Ellen M. Voorhees: The TREC Spoken Document Retrieval Track: A Success Story. In 8th Text Retrieval
Conference, pages 107--129, Washington, 2000.
Chelba, C.; Hazen, T.J.; Saraclar, M.: Retrieval and browsing of spoken content. Signal Processing Magazine, IEEE , vol.25, no.3, pp.39-49, May
2008.