[Elsnet-list] Five PhD studentships in Speech Technology at Edinburgh

Steve Renals s.renals at ed.ac.uk
Thu Nov 3 17:57:34 CET 2011

Five fully-funded PhD studentships in speech and language processing are available at the Centre for Speech Technology Research, University of Edinburgh. The expected starting date for these studentships is September 2012.

One PhD studentship is supported by the EPSRC Natural Speech Technology Project, three are supported by the JST CREST uDialogue Project, and one will be supported by industrial funding. 

The projects cover a wide variety of topics in the areas of speech synthesis, speech recognition, language modelling, and spoken dialogue processing. They all include exciting opportunities to work with our other project partners, in the UK or Japan.

Speech synthesis topics include reactive statistical parametric speech synthesis in which various conversational acoustic and verbal cues are controllable, new acoustic models for statistical parametric speech synthesis inspired by recent innovations such as subspace modelling and deep learning, prosody modelling beyond the sentence for audio book tasks, and expressive speech synthesis.

Speech recognition topics are based on multi-lingual speech recognition, in particular language modelling approaches that could share parameters across languages.

Spoken dialogue processing topics include implicit spoken dialogue systems that do not require the full attention of the user and that learn when to intervene in a conversation, structural learning of spoken dialogue contents, and crowd-sourcing for learning dialogue content.

Full descriptions of the topics can be found below. 

Suitable candidates will have a good first degree in a suitable discipline and a strong interest in speech processing, machine learning, statistics, cognitive science, linguistics, informatics, engineering, mathematics, or related area. A relevant Masters' degree is desirable but not essential.

Potential applicants are encouraged to contact Simon King, Steve Renals, or Junichi Yamagishi to discuss the topics. Contact details can be found at http://www.cstr.ed.ac.uk/people . For information on the formal application process, please see http://www.cstr.ed.ac.uk/opportunities/phd.html and http://www.ed.ac.uk/schools-departments/informatics/postgraduate/apply/overview .

All topics are flexible and we welcome applicants with their own original ideas.

The anticipated start date is September 2012 but this is also flexible; earlier start dates are possible.  

Applications submitted before 16 December 2011 are preferred.



The goal of this project is to statistically model not only how speech sounds but also how it is produced. This will be done by developing models with 'deep architectures'.

We have already developed a two-layer time-series statistical model of speech and have applied it to the joint modelling of spectral features and articulatory features, including tongue movements captured using electromagnetic articulography. Using this model, synthetic speech generated can be explicitly controlled via articulation.

In order to incorporate additional knowledge to acoustic modelling of speech and to increase the controllability of speech synthesis further, this project will develop a deeply layered model, inspired by human speech production and perception, with layers corresponding to not only articulatory features but also other meaningful features such as vocal fold vibrations, via images of the glottis.

Other layers which incorporate rich linguistic knowledge might be intrinsic to the speech or speaker, such as dialect, or external factors such as the signal-to-noise ratio, or perceptual masking effects, thus creating synthesisers that can be controlled in response to the listening situation. Such an approach raises a number of scientific questions, including how to acquire and parameterise features for deep architectures, how to train the models and structures between the layers, and how to represent and apply prior knowledge.

This project will include an industrial internship.

Contact: Dr Junichi Yamagishi



The project will develop acoustic models of speech and methods for natural language generation that reflect the various causes and cues in conversational speech and allow greater control for speech synthesis.

Currently, statistical parametric speech synthesis is trained from speech data that mainly comprises read news text sentences. However, speech synthesis in dialogue systems requires a more conversational style.  Synthetic speech created from read-text models sounds completely quite unlike genuine conversational speech in terms of acoustic properties, linguistic construction and conversational markers which are produced in response to the conversation partner. Conversational speech is not only more casually articulated but contains many interesting effects such hesitations, prolongations, filled pauses and so on. These are thought to assist the listener and lead to a more effective conversational flow. To create such reactive conversational synthetic speech, this project will consider both the acoustic and language models, incorporating new factors that are currently missing from the read-text type speech synthesisers.

This project is supported by Japan Science and Technology Agency (JST) and includes an extended visit to the Nagoya Institute of Technology in Japan who are our partners in the 'uDialogue' project, which is described below.

Contact: Dr Junichi Yamagishi or Prof Simon King



Until recently, the dominant language models for speech recognition were based strongly on n-grams, in which probability models are built over a vocabulary of words, resulting in very high dimensions (frequently 1 million or more).  In recent years there has been growing interest in models which use distributed representations of words, for example latent semantic analysis language models and neural network language models.  The latter, in particular, have proven to be very attractive.  In this project we plan to explore models in which the distributed representation is automatically learned, enabling words to embedded in what may be considered a semantic space.  We are particularly in investigating approaches based on deep neural networks, hierarchical Bayes, and on ideas from factorised language model approaches such as Model M.  

We are interested in applying these ideas to language modelling for speech recognition in a multilingual context.  How can we bootstrap a language model in a target language given an existing language model in a source language?  Can we use semantic or other distributed representations that are sharable across languages?

This project is supported by Japan Science and Technology Agency (JST) and includes an extended visit to the Nagoya Institute of Technology in Japan who are our partners in the 'uDialogue' project, which is described below.

Contact: Prof Steve Renals



Current dialogue systems do not interact very naturally - for example, they often demand 100% of the users' attention and attempt to interpret and respond to every utterance.  In the long term we would like to develop spoken language systems which learn how to interact in multiparty conversations more naturally, and to develop systems which can learn from their own experience and from other dialogues and dialogue systems.  There are a number of potential PhD topics in this direction.

(a)  Human-computer dialogues can be improved if the computer is able to interpret - and appropriately respond to - the social signals of the talkers, the social context of the conversation, and the content of the conversation.  This requires developing approaches to automatically extract and recognise the social signals (such as positivity/negativity, frustration, engagement) in a conversation, and to infer the the social context of the conversation, based on these signals, the words that are spoken, and any external metadata (such as location, time of day, etc.)

(b) Automatic learning of the structure of dialogues, and using such structure to enable new dialogue scenarios to be constructed from existing dialogues.  This project would aim to develop hierarchical representations of dialogues based on automatically recognised dialogue acts, on patterns of speaker turns and dialogue act usage, and on observed social signals.  Possible applications of such structuring would include: (i) the development of dialogue systems which learn when they are being addressed, or when would be an appropriate moment to make an utterance; and (ii)  the development of new dialogue systems via the automatic reuse of dialogue components based on structural similarity.

This project is supported by Japan Science and Technology Agency (JST) and includes an extended visit to the Nagoya Institute of Technology in Japan who are our partners in the 'uDialogue' project, which is described below.

Contact: Prof Steve Renals or Dr Junichi Yamagishi



Even within the last 5 years, rapid progress has been made in many areas of speech synthesis and we now have systems which are often as intelligible as natural speech and capable of acceptable naturalness in a very limited range of applications based on "read text". However, prosodically interesting, expressive and engaging speech synthesis remains a major challenge. This project will build on our recent work in creating speech synthesis that is controllable in terms of spectral characteristics, but focus on the prosodic aspects of the synthetic speech. One goal is expressive synthesis for creating audio books from text. A number of potential topics and directions of research are possible in this area. Two of these are given below, but we welcome applicants with other suggestions.

(a) Wide-context textual features for speech synthesis. Most current systems ignore text features beyond the current sentence, but in a discourse there are rich features waiting to be exploited. Whilst a full semantic analysis of discourse remains very challenging, shallow processing techniques possibly including unsupervised approaches, may discover sufficient features to significantly improve the expressivity and appropriateness of synthetic speech for specific tasks such as audiobook reading.

(b) Expressivity control. Whilst more expressive synthetic speech can be simulated through the use of recordings of acted speech in the required style, this offers no explicit control over the output. The goal of this topic would be to develop a deeper and more structured model of `expressivity' which enables external, parametric control over individual aspects of the output speech, including its prosody.

This project is supported by the EPSRC Programme Grant 'Natural Speech Technology', described below, and includes the opportunity to collaborate with and visit our project partners at the Universities of Sheffield and Cambridge and the possibility of testing these new approaches to expressive speech synthesis in home-care and assistive communication aid applications.

Contact: Prof Simon King


About the uDialogue project

uDialogue is a joint project with the Nagoya Institute of Technology in Japan, funded by the Japan Science and Technology Agency (JST).  The overall goal of uDialogue is the development of spoken dialogue systems based on user-generated content, and the project contains research on speech synthesis, speech recognition, and spoken dialogue. Each of the PhD studentships is of a 4-year duration, and includes a 6 month internship at the Nagoya Institute of Technology.



About the Natural Speech Technology project

Natural Speech Technology (NST) is an EPSRC Programme Grant with the aim of significantly advancing the state-of-the-art in speech technology by making it more natural, approaching human levels of reliability, adaptability and conversational richness. NST is a collaboration between the Centre for Speech Technology Research (CSTR) at the University of Edinburgh, the Speech Group at the University of Cambridge and The Speech and Hearing Research Group (SpandH), University of Sheffield.



The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

More information about the Elsnet-list mailing list