Osion kuvaus

  • Good reading material for the ASR course:


    X.D.Huang: Spoken language processing : a guide to theory, algorithm, and system development. Prentice-Hall, 2001

    The HTK Book: http://htk.eng.cam.ac.uk/docs/docs.shtml

    M. Gales and S. Young: The Application of Hidden Markov Models in Speech Recognition. In Foundations and Trends in Signal Processing Vol. 1, No. 3 (2007) 195-304. http://dx.doi.org/10.1561/2000000004

     
     

    Additional reading for some special topics in ASR:


     

    Speaker recognition


    Speaker recognition: a tutorial Campbell, J.P., Jr.; Proceedings of the IEEE Volume 85,  Issue 9,  Sept. 1997 Page(s):1437 - 1462
    DOI 10.1109/5.628714

    Sadaoki Furui: Recent advances in speaker recognition. Pattern Recognition Letters 18(9): 859-872 (1997)

    http://cslu.cse.ogi.edu/HLTsurvey/ch1node9.html


    Speaker adaptation


    Gales, M. J., Young, S. J. The Application of Hidden Markov Models in Speech Recognition. Foundations and Trends in Signal Processing, 2007, vol. 1, no. 3, pp. 241-255.
    http://dx.doi.org/10.1561/2000000004

    Woodland, P. C. Speaker Adaptation for Continuous Density HMMs: A Review. ITRW on Adaptation Methods for Speech Automatic Recognition, 2001, pp. 11-18.
    http://www.isca-speech.org/archive/adaptation/adap_011.html

    Wang Zhirong, Schultz Tanja, Waibel Alex, Comparison of Acoustic Model Adaptation Techniques on Non-native Speech. Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE International Conference.

    Koichi Shinoda  "Speaker Adaptation Techniques for Automatic Speech Recognition"  APSIPA ASC 2011 Xi’an Tokyo Institute of Technology, Tokyo, Japan3.

    Speech Adaptation

    X. Zhu, G. T. Beauregard, and L. L. Wyse, “Real-time signal estimation from modified short-time fourier transform magnitude spectra,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 5, pp. 1645–1653, July 2007.

    H. K. Kathania, W. Ahmad, S. Shahnawazuddin, and A. B. Samaddar,“Explicit pitch mapping for improved children’s speech recognition,”Circuits, Systems, and Signal Processing, September 2017.

    H. K. Kathania, S. Shahnawazuddin, W. Ahmad, N. Adiga, S. K. Jana, and A. B. Samaddar, “Improving children’s speech recognition through time scale modification based speaking rate adaptation,” in 2018 International Conference on Signal Processing and Communications (SPCOM), July 2018

    Audio indexing


    Steve Renals, Dave Abberley, David Kirby, Tony Robinson, "Indexing and
    retrieval of broadcast news"
    , Speech CommunicationVolume 32, Issues 1-2,
    September 2000, Pages 5-20.

    John S. Garofolo, Cedric GP Auzanne, Ellen M. Voorhees, "The TREC Spoken Document Retrieval Track: A Success Story" In 8th Text Retrieval
    Conference, pages 107--129, Washington, 2000.

    Chelba, C.; Hazen, T.J.; Saraclar, M., "Retrieval and browsing of spoken content," Signal Processing Magazine, IEEE , vol.25, no.3, pp.39-49, May
    2008.

    http://ieeexplore.ieee.org/stam/stamp.jsp?arnumber=4490200isnumber=4490183

    DNNs for acoustic modeling

    - Kaldi tutorial: Kaldi is currently the most popular ASR toolkit in research. We recommend using it for this project. You don't have to complete the tutorial, but reading it can help you understand some basics.
    - Recent Progresses in Deep Learning based Acoustic Models: A very recent survey on acoustic modeling for ASR
    - Deep Neural Networks for Acoustic Modeling in Speech Recognition: An older survey on acoustic modeling for ASR


    Pytorch-Kaldi for Acoustic Modeling on Finnish Speech


    Suggested Reading:
    - THE PYTORCH-KALDI SPEECH RECOGNITION TOOLKIT: The toolkit to be used in the project
    - Recent Progresses in Deep Learning based Acoustic Models: A very recent survey on acoustic modeling for ASR
    - Deep Neural Networks for Acoustic Modeling in Speech Recognition: An older survey on acoustic modeling for ASR

    Language Modeling for Indian Languages


    Suggested Reading:
    - BUT system for low resource Indian language ASR : This paper uses the same dataset as used in the project.
    - Recurrent neural network based language model: A very popular contemporary neural network language model
    - IMPROVED BACKING-OFF FOR M-GRAM LANGUAGE MODELING: A popular n-gram language modeling technique
    - LSTM Neural Networks for Language Modeling: Another popular contemporary neural network language model
    - Exploring the Limits of Language Modeling: A survey on existing language modeling techniques
    - TheanoLM - An extensible toolkit for neural network language modeling: Neural network language modeling toolkit you will use in your work
    - On Growing and Pruning Kneser–NeySmoothedN-Gram Models: n-gram language modeling toolkit you will use in your work

    Features for ASR


    Hynek Hermanskyn "Should recognizers have ears?", Speech Communication 25 (1998) 3-27.

    Hynek Hermansky's original article in Journal of Acoustic Society of America  87 (4), 1990.

     

    Confidence measure for ASR


    H. Jiang. Condence measures for speech recognition: A survey. Speech communication, 45(4):455{470, 2005.
    T. Schaaf and T. Kemp. Condence measures for spontaneous speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP-97, pages 875-878, IEEE, 1997.

    JGA Dolfing and A. Wendemuth. Combination of condence measures in isolated word recognition. In Proceedings of the International Conference on
    Spoken Language Processing, pages 3237-3240, 1998.

    F. Wessel, R. Schluter, K. Macherey, and H. Ney. Condence measures for large vocabulary continuous speech recognition. IEEE Transactions on Speech and Audio Processing, 9(3):288-298, 2001.
    T. Fabian. Condence Measurement Techniques in Automatic Speech Recognition and Dialog Management. Der Andere Verlag, 2008.


    Language recognition


    Li, H. et al. (2013). Spoken Language Recognition: From Fundamentals to Practice. Thorough overview of language recognition.
    Tang, Z. et al. (2018). Phonetic Temporal Neural Model for Language Identification. Sections I.A and I.B provide another short overview.
    Gonzalez-Dominguez, Javier et al. (2014). Automatic language identification using long short-term memory recurrent neural networks. Deep learning approach with DNNs and LSTMs.
    Martínez, David et al. (2011). Language Recognition in iVectors Space. Statistical approach.
    Zissman, M. A. (1996). Comparison of four approaches to automatic language identification of telephone speech. Comparing GMMs and PRLM variants.

    Speech compression for ASR


    http://www.data-compression.com/speech.html

    B.T.Lilly and K.K. Paliwal, Effect of speech coders on speech recognition performance, Griffith University, Australia.

    Juan M. Huerta and Richard M. Stern, Speech recognition from GSM codec parameters, Carnegie Mellon University, USA.

    Dan Chazan, Gilad Cohen, Ron Hoory and Meir Zibulski, Low bit rate speech compression for playback in speech recognition systems, in Proc. Eur. Signal Processing Conf

    L. Besacier, C. Bergamini, D. Vaufreydaz and E. Castelli, The effect of speech and audio compression on speech recognition performance, In: Proc. IEEE Multimedia Signal Processing Workshop


    Speech recognition in noise


    Gales, M. J., Young, S. J. The Application of Hidden Markov Models in Speech Recognition. Foundations and Trends in Signal Processing, 2007, vol. 1, no. 3, pp. 241-268.  Chapters 5-6.
    http://dx.doi.org/10.1561/2000000004

    Woodland, P. C. Speaker Adaptation for Continuous Density HMMs: A Review. ITRW on Adaptation Methods for Speech Automatic Recognition, 2001, pp. 11-18.
    http://www.isca-speech.org/archive/adaptation/adap_011.html

     

    Voice activity detection


    J. Ramírez, J. M. Górriz and J. C. Segura. Voice Activity Detection. Fundamentals and Speech Recognition System Robustness
    https://www.intechopen.com/chapter/pdf-download/104

    Christian Uhle, book editor: Tom Bäckström Voice Activity Detection. Chapter 13 in Speech Coding
    https://aalto.finna.fi/Record/alli.781859 (available as E-book)
     

    Speech synthesis

    Speech synthesis resources:
    Prof. Simon King’s tutorial at Interspeech 2017, includes videos (recommended)
    http://www.speech.zone/courses/one-off/merlin-interspeech2017/

    Paul Taylor’s book “Text-to-Speech Synthesis”. Comprehensive reference for HMM-based parametric TTS (but has no neural nets)
    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.5905&rep=rep1&type=pdf

    Simon King’s (less comprehensive) introduction to HMM-based parametric synthesis:
    http://www.cstr.ed.ac.uk/downloads/publications/2010/king_hmm_tutorial.pdf

    Code for projects:
    Open source DNN-based parametric speech synthesis:
    https://github.com/CSTR-Edinburgh/merlin/tree/master/src

    Current state-of-the-art: Tacotron and WaveNet (GPU required)
    https://google.github.io/tacotron/publications/tacotron2/
    Code:
    https://github.com/keithito/tacotron
    https://github.com/r9y9/wavenet_vocoder

     

    Multichannel ASR


     D. V. Compernolle, "DSP techniques for speech enhancement”, in Proc. ESCA Workshop on Speech Processing in Adverse Conditions, 21-30 (1992)

    http://www.esat.kuleuven.be/psi/spraak/cgi-bin/get_file.cgi%3F/ps_reconstruct/compi_netrw92.ps%26ps%26dvc:wspac92&sa=X&scisig=AAGBfm2_BW9iTBqp9wNXNOP5gigjiYYCKg&oi=scholarr

    J. F. Cardoso, "Blind signal separation: statistical principles”, Proc. IEEE 9, 2009-2025 (1998)
    http://perso.telecom-paristech.fr/~cardoso/Papers.PDF/ProcIEEE.pdf


    Animation of HMMs


    This topic has no external material, but use should study the basic HMM algorithms (Forward-Backward and Viterbi) carefully. Thus, check it from several sources, e.g:
    -my lecture slides in 2nd Wednesday's lecture
    -Manning & Schutze: pp. 326-327
    -Jurafsky & Martin (1st edition): pp. 843-849
    -Gales & Young: pp. 205-206  (DOI 10.1109/5.628714)
    -Rabiner: "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition" pp. 262-263



    Pronunciation model adaptation

    Bisani, M., & Ney, H. (2008). Joint-sequence models for grapheme-to-phoneme conversion. Speech communication, 50(5), 434-451.

    Maas, A., Xie, Z., Jurafsky, D., & Ng, A. (2015). Lexicon-free conversational speech recognition with neural networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 345-354).

    Tools: Sequitar G2P
    Materials: WSJ_5k (from exercise 4)

    Automatic detection of alcohol intoxication

    Wang, W. Y., Biadsy, F., Rosenberg, A., & Hirschberg, J. (2013). Automatic detection of speaker state: Lexical, prosodic, and phonetic approaches to level-of-interest and intoxication classification. Computer Speech & Language, 27(1), 168-189.

    Schuller, B., Steidl, S., Batliner, A., Schiel, F., Krajewski, J., Weninger, F., & Eyben, F. (2014). Medium-term speaker states—A review on intoxication, sleepiness and the first challenge. Computer Speech & Language, 28(2), 346-374.

    Tools: OpenSMILE, Anaconda?
    Materials: Alcohol Language Corpus

    Comparing subword language models

    Hirsimäki, T., Creutz, M., Siivola, V., Kurimo, M., Virpioja, S., & Pylkkönen, J. (2006). Unlimited vocabulary speech recognition with morph language models applied to Finnish. Computer Speech & Language, 20(4), 515-541.

    Tools: SRILM, VariKN, Kaldi, Morfessor
    Materials: STT, Kielipankki, Wikipedia


    Peter Smit: Modern subword-based models for automatic speech recognition (2019). PhD thesis. Aalto University. https://aaltodoc.aalto.fi/handle/123456789/38073

    Sami Virpioja: Learning Constructions of Natural Language: Statistical Models and Evaluations (2012). PhD thesis. Aalto University. https://aaltodoc.aalto.fi/handle/123456789/7294


    Chatbots

    Zhang, Saizheng, et al. "Personalizing Dialogue Agents: I have a dog, do you have pets too?." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018.

    Dinan, Emily, et al. "The second conversational intelligence challenge (convai2)." arXiv preprint arXiv:1902.00098 (2019).

    Tools: ParlAI, Hugging Face Transformers
    Materials: PersonaChat, OpenSubtitles

    https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313
    https://parl.ai/

    https://convai.io/

    Connectionist temporal classification

    Awni Hannun, "Sequence Modeling With CTC", https://distill.pub/2017/ctc/

    Alex Graces, "Connectionist Temporal Classification: Labelling Unsegmented
    Sequence Data with Recurrent Neural Networks", https://www.cs.toronto.edu/~graves/icml_2006.pdf

    Attention-based encoder-decoder end-to-end ASR

    Chan et al. Listen, attend, and spell
    Chorowski et al. Attention-based models for speech recognition
    Chiu et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models
    https://ieeexplore.ieee.org/abstract/document/8462105
    Lüscher et al. RWTH ASR Systems for LibriSpeech: Hybrid vs Attention
    https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1780.pdf
    Watanabe et al. ESPnet: End-to-end speech processing toolkit


    Data Augmentation

    Tom Ko, Vijayaditya Peddinti, Daniel Povey, Sanjeev Khudanpur," Audio Augmentation for Speech Recognition" INTERSPEECH 2015.

    Mengjie Qian ; Ian McLoughlin ; Wu Quo ; Lirong Dai "Mismatched training data enhancement for automatic recognition of children's speech using DNN-HMM" 10th International Symposium on Chinese Spoken Language Processing (ISCSLP) 2016. https://ieeexplore.ieee.org/document/7918386

    Native Language Recognition

    Baseline paper: The INTERSPEECH 2016 Computational Paralinguistics Challenge: Deception, Sincerity & Native Language https://www.isca-speech.org/archive/Interspeech_2016/pdfs/0129.PDF
    Winner paper: Exploiting phone log-likelihood ratio features for the detection of the native language of non-native English speakers (https://www.isca-speech.org/archive/Interspeech_2016/pdfs/1491.PDF)
    Other papers available here: https://www.isca-speech.org/archive/Interspeech_2016/  (check the Special Session: Interspeech 2016 Computational Paralinguistics Challenge (ComParE): Deception, Sincerity & Native Language blocks)

    Deep Denoising Autoencoder for Speech Enhancement

    Speech Enhancement Based on Deep Denoising Autoencoder, https://www.isca-speech.org/archive/archive_papers/interspeech_2013/i13_0436.pdf
    Exploring multi-channel features for denoising-autoencoder-based speech enhancement (if you decide to use multichannel data)
    Joint training of front-end and back-end deep neural networks for robust speech recognition
    Tensorflow implementation: https://github.com/jonlu0602/DeepDenoisingAutoencoder

    Language model adaptation

    N-gram models adaptation techniques:
    TKK Master's thesis by Simo Broman pp. 9-41
    http://cis.legacy.ics.tkk.fi/sbroman/dippa.pdf
    Master's thesis by Andre Mansikkaniemi pp. 47-53
    http://lib.tkk.fi/Dipl/2010/urn100143.pdf
    Statistical language model adaptation: review and perspectives
    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.4893&rep=rep1&type=pdf

    Neural language model adaptation techniques:
    Approaches for Neural-Network Language Model Adaptation
    https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46439.pdf 
    Recurrent Neural Network Language Model Adaptation for Conversational Speech Recognition
    http://www.danielpovey.com/files/2018_interspeech_lm_adapt.pdf