Topic outline

  • Introduction

    This is an individual project course within speech and language processing. The extent can be chosen together with the supervisor anywhere between 1 and 10 ECTS. Typically projects are small research and development tasks, including documentation of findings, or literature reviews.

    This course is a good way for

    1. MSc students to gain practical experience with speech and language processing by working on a specific topic
    2. MSc students to scope out potential future master's thesis topics and supervisors
    3. doctoral students to earn ECTS by working on topics that are distinct from but related to their doctoral thesis.

    The teaching objective of assignments is to practice independent research and project work in a format similar to your future work. This includes among others project planning, searching for information, implementing and developing algorithms, choosing suitable experiments for testing and validation and writing a research report.

    The choice of topic is free as long as it is about speech and language processing, but it is highly recommended that the topic is either

    1. Something where the student has a particular interest, like a topic with a connection with a hobby, work, or idea for a startup.
    2. It is useful for one of the research groups in speech and language technology. Below is a list of suggested topics from each of the research groups, together with a contact person. 
    As a last resort, if you have trouble choosing a topic, contact one of the teachers.

    To get started

    Choose a topic!
    • If you have a topic of your own, choose a teacher whose interests align with your topic (list below) and contact them by email.
    • If you choose one of the topics below, send email to the contact person.


    Schedule

    You can start when you have time. Typically projects last 1-2 periods.

    Supervising teachers

    • professor Paavo Alku (interests include analysis of speech production, speech in health technology (e.g. speech-based detection of diseases), signal processing and machine learning in medical analysis of speech)
    • associate professor Tom Bäckström (interests include speech enhancement, privacy, speech in embedded devices, machine learning, voice conversion, and speech coding, etc.)
    • assistant professor Lauri Juvela (interests include speech synthesis, machine learning, audio, speech and audio in embedded devices, differentiable DSP etc.)
    • professor Mikko Kurimo (interests include automatic speech recognition, machine learning, etc.)

    Suggested topics

    1. Input bandwidth estimation with differentiable DSP for machine learning with dynamic complexity. Real-world audio equipment and software reproduce very different ranges of the spectrum. For example, cheap microphones can attenuate higher frequencies such that it is hard to know what parts of the spectrum are available. By estimating the usable frequency range, we can reduce complexity of machine learning methods by processing only that part of spectrum which is usable.
      Contact person: Esteban Gómez or Tom Bäckström

    2. Serverless listening test software with for example webAssembly and webAudio. Available software libraries for crowdsourced listening tests like webMushra all require a server, whose implementation and maintenance are cumbersome. The task would be to implement at least a proof-of-concept level implementation of a listening test that can run in a browser. 
      Contact person: Tom Bäckström

    3. Audio watermarking for protection against speech deep-fakes. This is an exploratory study to determine the state-of-art in audio watermarking and protection against speech deep-fakes through a literature study, as well as experimenting with available algorithms. The idea is that watermarks could be made legally mandatory for all "deep-fake" like technologies. The objective is to investigate the extent to which this idea is feasible and effective.
      Contact person: Lauri Juvela and Tom Bäckström

    4. Speech-based biomarking of health with machine learning. In addition to its linguistic contents, speech includes extralinguistic information about the speaker's state of health. Therefore, the speaker's state of health can be predicted from speech signal in a non-invasive manner. Increasing research interest is devoted particularly to detect Covid-19 or neurodegenerative diseases (such as Parkinson’s disease and Alzheimer’s disease) from speech signals using both classical ML methods (such as SVMs) and more recent deep learning methods. Specific topics (including literature reviews, small-scale experiments etc.) are provided in this health -related research area.
      Contact person: Paavo Alku

    5. Learnable filterbanks: Filterbanks such as Mel or Bark are commonly used as fixed frontend transformations in many audio processing tasks. The goal of this project would be to implement a neural network layer that can be initialized as commonly used filterbanks, but whose weights can be updated through training and hence tailored to a specific audio processing task.
      Contact person: Esteban Gómez or Tom Bäckström

    6. Automatic speech recognition (ASR) and language modeling for spontaneous speech. Most public speech data is either read-aloud texts or scripted broadcasting material. Similarly most public text data is written material. However, most use cases for ASR are to recognize speech that is not available in text nor planned ahead. They include conversations, interviews, meetings and interaction with computers, robots or automated services. The work is study how to use the limited spontaneous speech resources to adapt the existing large speech and language models.
      Contact person: Mikko Kurimo

    7. Automatic speech recognition (ASR) and speaking assessment and feedback for foreign language learners. Most public speech data is spoken by native speakers of the language. However, there are many use cases for ASR where the speakers are non-native or foreign language (L2) learners. They include interviews, meetings, lectures or applications to practise or assess L2 skills. Furthermore, the L2 speech can be automatically analysed and the feedback may be very useful for improving the languge skills. The work is to study how to use the limited L2 speech resources to adapt the existing large speech and language models to recognize, analyse and compute feedback for L2 learners.
      Contact person: Mikko Kurimo

    8. Design and implementation of red-teaming for LLM-based applications. In this project, you will design and implement a concept for systematic red-teaming of applications built on LLM’s, and implement a hackathon focused on red-teaming a selection of applications. The target of the exercise is to design and test a red teaming concept as a means to identify LLM-related risks and vulnerabilities as part of enterprise AI governance process.
      Contact persons
      : Tom Bäckström and Meeri Haataja (Saidot.ai)

    9. Neural speech coding with dynamic bitrate. In the last 5 years, speech codecs based on deep neural networks have come to dominate the field. A central deficiency with such codecs is that the bitrate is a pre-determined constant, in comparison to classical DSP-based codecs where bitrates can be chosen freely on some range. In principle, we have to retrain the network to adjust the bitrate. To solve this issue, in this project, we study quantization methods that enable dynamic accuracy in application to state-of-the-art speech codecs.
      Contact person: Tom Bäckström

    10. Measuring trust in conversations. The European Union is actively promoting the design of Trustworthy AI with their publication of ethics guidelines. While this goal is a great idea, a major drawback currently is the lack of reliable measures of trust which NASA is actively investigating as well. How do we even know that it is a trustworthy AI? We are looking to find ways to measure trust in ongoing conversations with neural networks, maybe with biosignals, speech, or other ways.
      Reference Papers: https://journals.sagepub.com/doi/abs/10.1177/1071181322661147 and https://dl.acm.org/doi/10.1145/3434073.3444677
      Prerequisite: Understanding of deep learning
      Contact persons: Silas Rech and Tom Bäckström

    11. More interactive, natural conversation with voice assistants. ChatGPT has changed the chatbot landscape completely. While OpenAI and other companies have built their own LLMs and integrated them in their voice assistants, there is still a noticeable delay between speech input and response which severely impacts an interaction with a voice assistant. In this thesis, we explore system designs that would improve responsiveness and latency reduction methods to improve the naturalness of conversations with voice assistants.
      Reference Paper: https://www.sciencedirect.com/science/article/pii/S0747563221000492
      Prerequisite: Understanding of deep learning and speech processing
      Contact persons: Silas Rech and Tom Bäckström
    12. Implementing and testing real-time speech processing with audio plugins for echo cancellation and sidetalk. Most headsets use a feature known as side-talk, where the microphone signal is fed back to the speakers, such that the user 1. hears their own voice in a natural way despite covering their ears and 2. receives feedback that the headset is turned on and working correctly. We would like to experiment with a similar functionality on laptops and desktop computers, to determine if this is useful feedback for the user also there. The issue is that a loop between the microphone and loudspeaker creates very annoying feedback. Therefore, we have to also implement a real-time echo cancellation module in the system. We envision that this is possible primarily with audio plugins.
      Contact person: Tom Bäckström