[Elsnet-list] Ph.D. position in audiovisual speech, INRIA Nancy, France
Slim.Ouni at loria.fr
Mon Apr 8 12:25:45 CEST 2013
Funded Ph.D. position in computer science, INRIA Nancy, Nancy , France.
Application deadline: 04/05/2013
Starting date: Sept. 2013
Contact: Slim Ouni - Slim.Ouni at loria.fr
More details and to apply:
(to apply, click on "Apply online" button at the end of the web page)
Emotion modeling during expressive audiovisual speech
*** Scientific Context:
In naturally produced speech, the acoustic signal is the result of the deformation of the vocal tract (the jaw, lips and tongue). Indeed, speech communication is essentially multimodal, consisting of the acoustic signal that carries the auditory modality, the movements of the tongue, jaw and lips are the articulatory modality, and the facial deformation represents the visual modality. This audiovisual communication is not just expressing the phonetic content of spoken text but it allows conveying a mood or an emotion (joy, sadness, disgust, etc.)
Within the framework of audiovisual speech synthesis, commonly known as the animation of a 3D virtual talking head synchronously with acoustics, the emotion synthesis is processed additively. However, expressive audiovisual speech synthesis is a combination between facial deformation necessary to produce articulatory gestures for speech and those required to produce facial expression [BB08]. Thus, expressivity cannot just be added to audiovisual synthesis result, especially facial expressions that affect speech. It is actually a complex process, usually blended with speech articulation, and often expressed subtly rather than exaggerated emotions.
The goal of this thesis is to study the expressivity from articulatory and visual points of view. The articulatory and facial gestures will be characterized for the different sounds of speech (called phonemes) in the different expressive contexts. The goal is to determine how facial expressions interact with speech gesture (lips, tongue, face and acoustics), and how this is embedded within the phoneme articulation and their acoustic consequences. The quantification of the intensity of an expression during a given phoneme (or sequence of phonemes) articulation needs to be determined. One important objective of this work is to develop an expressive control model, which describes the interaction of the facial expressions with audiovisual speech.
To achieve these goals, two corpora will be acquired using electromagnetography (EMA) and motion capture techniques synchronously with acoustics. The EMA is the technique that uses electromagnetic sensors, glued on the tongue, teeth, lips and possibly the face, that represent 3D positions, and two angle orientations. These sensors are used within an articulograph that handles between 12 and 24 sensors. The motion capture allows retrieving the 3D positions of the lips from the recorded movement of reflective markers tracked by cameras. The corpora will cover sentences pronounced in several emotional contexts. These corpora provide the articulatory trajectories of the tongue and the lips in addition to the acoustic signal. The acquired data will be processed and analyzed, and the control model will be developed based on the results of this analysis.
*** Candidate Profile:
Required qualification :
Master in computer science.
Good background in modelling, data analysis and machine learning.
More information about the Elsnet-list