[Elsnet-list] INTERSPEECH 2014 - Special Sessions

Organization @ Interspeech 2014 organization at interspeech2014.org
Mon Feb 10 05:20:11 CET 2014

--- September 14-18, 2014
--- http://www.INTERSPEECH2014.org

INTERSPEECH is the world's largest and most comprehensive conference on 
issues surrounding
the science and technology of spoken language processing, both in humans 
and in machines.
The theme of INTERSPEECH 2014 is

--- Celebrating the Diversity of Spoken Languages ---

INTERSPEECH 2014 includes a number of special sessions covering 
interdisciplinary topics
and/or important new emerging areas of interest related to the main 
conference topics.
Special sessions proposed for the forthcoming edition are:

• A Re-evaluation of Robustness
• Deep Neural Networks for Speech Generation and Synthesis
• Exploring the Rich Information of Speech Across Multiple Languages
• INTERSPEECH 2014 Computational Paralinguistics ChallengE (ComParE)
• Multichannel Processing for Distant Speech Recognition
• Open Domain Situated Conversational Interaction
• Phase Importance in Speech Processing Applications
• Speaker Comparison for Forensic and Investigative Applications
• Text-dependent for Short-duration Speaker Verification
• Tutorial Dialogues and Spoken Dialogue Systems
• Visual Speech Decoding

A description of each special session is given below.
For paper submission, please follow the main conference procedure and 
chose the Special Session track when selecting your paper area.

Paper submission procedure is described at:

For more information, feel free to contact the Special Session Chair,
Dr. Tomi H. Kinnunen, at email tkinnu [at]cs.uef.fi

Special Session Description

A Re-evaluation of Robustness

The goal of the session is to facilitate a re-evaluation of robust speech
recognition in the light of recent developments. It’s a re-evaluation at 
two levels:
• a re-evaluation in perspective brought by breakthroughs in performance 
by Deep Neural Network which leads to a fresh questioning of the role and
contribution of robust feature extraction.
• A literal re-evaluation on common databases to be able to present and 
performances of different algorithms and system approaches to robustness.
Paper submissions are invited on the theme of noise robust speech 
and required to submit results on the Aurora 4 database to facilitate 
cross comparison
of the performance between different techniques.

Recent developments raise interesting research questions that the 
session aims to help
Progress by bringing focus and exploration of these issues. For example
1. What role is there for signal processing to create feature 
representations to use as
inputs to Deep Learning or can deep learning do all the work?
2. What feature representations can be automatically learnt in a deep 
learning architecture?
3. What other techniques can give great improvement in robustness?
4. What techniques don’t work and why?

The session organizers wish to encourage submissions that bring insight 
and understanding to
the issues highlighted above. Authors are requested not only to present 
absolute performance
of the whole system but also to highlight the contribution made by 
various components in a
complex system.

Papers that are accepted for the session are encouraged to also evaluate 
their techniques on new test
data sets (available in July) and submit their results at the end of August.

Session organization
The session will be structured as a combination of
1. Invited talks
2. Oral paper presentations
3. Poster presentations
4. Summary of contributions and results on newly released test sets
5. Discussion

David Pearce, Audience dpearce [at]audience.com
Hans-Guenter Hirsch, Niederrhein University of Applied Sciences, 
hans-guenter.hirsch [at]hs-niederrhein.de
Reinhold Haeb-Umbach, University of Paderborn, haeb [at]nt.uni-paderborn.de
Michael Seltzer, Microsoft, mseltzer [at]microsoft.com
Keikichi Hirose, The University of Tokyo, hirose [at]gavo.t.u-tokyo.ac.jp
Steve Renals, University of Edinburgh, s.renals [at]ed.ac.uk
Sim Khe Chai, National University of Singapore, simkc [at]comp.nus.edu.sg
Niko Moritz, Fraunhofer IDMT, Oldenburg, niko.moritz [at]idmt.fraunhofer.de
K K Chin, Google, kkchin [at]google.com

Deep Neural Networks for Speech Generation and Synthesis

This special session aims to bring together researchers who work 
actively on deep neural
networks for speech research, particularly, in generation and synthesis, 
to promote and
to understand better the state-of-art DNN research in statistical 
learning and compare
results with the parametric HMM-GMM model which has been 
well-established for speech synthesis,
generation, and conversion. DNN, with its neuron-like structure, can 
simulate human speech
production system in a layered, hierarchical, nonlinear and 
self-organized network.
It can transform linguistic text information into intermediate semantic, 
phonetic and prosodic
content and finally generate speech waveforms. Many possible neural 
network architectures or
typologies exist, e.g. feed-forward NN with multiple hidden layers, 
stacked RBM or CRBM,
Recurrent Neural Net (RNN), which have been used to speech/image 
recognition and other applications.
We would like to use this special session as a forum to present updated 
results in the research frontiers,
algorithm development and application scenarios. Particular focused 
areas will be on
parametric TTS synthesis, voice conversion, speech compression, 
de-noising and speech enhancement.

Yao Qian, Microsoft Research Asia, yaoqian [at]microsoft.com
Frank K. Soong, Microsoft Research Asia, frankkps [at]microsoft.com

Exploring the Rich Information of Speech Across Multiple Languages

Spoken language is the most direct means of communication between human 
beings. However,
speech communication often demonstrates its language-specific 
characteristics because of,
for instance, the linguistic difference (e.g., tonal vs. non-tonal, 
monosyllabic vs. multisyllabic)
across languages. Our knowledge on the diversities of speech science 
across languages is still limited,
including speech perception, linguistic and non-linguistic (e.g., 
emotion) information, etc.
This knowledge is of great significance to facilitate our design of 
language-specific application of
speech techniques (e.g., automatic speech recognition, assistive hearing 
devices) in the future.
This special session will provide an opportunity for researchers from 
various communities
(including speech science, medicine, linguistics and signal processing) 
to stimulate further discussion
and new research in the broad cross-language area, and present their 
latest research on understanding
the language-specific features of speech science and their applications 
in the speech communication of
machines and human beings. This special session encourages contributions 
all fields on speech science,
e.g., production and perception, but with a focus on presenting the 
language-specific characteristics
and discussing their implications to improve our knowledge on the 
diversities of speech science across
multiple languages. Topics of interest include, but are not limited to:
1. characteristics of acoustic, linguistic and language information in 
speech communication across
multiple languages;
2. diversity of linguistic and non-linguistic (e.g., emotion) 
information among multiple spoken languages;
3. language-specific speech intelligibility enhancement and automatic 
speech recognition techniques; and
4. comparative cross-language assessment of speech perception in 
challenging environments.

Junfeng Li, Institute of Acoustics, Chinese Academy of Sciences, 
junfeng.li.1979 [at]gmail.com
Fei Chen, The University of Hong Kong, feichen1 [at]hku.hk

INTERSPEECH 2014 Computational Paralinguistics ChallengE (ComParE)

The INTERSPEECH 2014 Computational Paralinguistics ChallengE (ComParE) 
is an open Challenge
dealing with speaker characteristics as manifested in their speech 
signal's acoustic properties.
This year, it introduces new tasks by the Cognitive Load Sub-Challenge, 
the Physical Load
Sub-Challenge, and a Multitask Sub-Challenge: For these Challenge tasks,
high diversity of
speakers and different languages covered (Australian English and German) 
are provided by the organizers.
All corpora provide fully realistic data in challenging acoustic 
conditions and feature rich
annotation such as speaker meta-data. They are given with distinct 
definitions of test,
development, and training partitions, incorporating speaker independence 
as needed in most
real-life settings. Benchmark results of the most popular approaches are 
provided as in the years before.
Transcription of the train and development sets will be known. All 
Sub-Challenges allow contributors
to find their own features with their own machine learning algorithm. 
However, a standard feature set
will be provided per corpus that may be used. Participants will have to 
stick to the definition of
training, development, and test sets. They may report on results 
obtained on the development set,
but have only five trials to upload their results on the test sets, 
whose labels are unknown to them.
Each participation will be accompanied by a paper presenting the results 
that undergoes peer-review
and has to be accepted for the conference in order to participate in the 
The results of the Challenge will be presented in a Special Session at 
INTERSPEECH 2014 in Singapore.
Further, contributions using the Challenge data or related to the 
Challenge but not competing within
the Challenge are also welcome.

More information is given also on the Challenge homepage:

Björn Schuller, Imperial College London / Technische Universität 
München,schuller [at]IEEE.org
Stefan Steidl, Friedrich-Alexander-University, stefan.steidl [at]fau.de
Anton Batliner, Technische Universität München / 
batliner [at]cs.fau.de
Jarek Krajweski, Bergische Universität Wuppertal, krajewsk 
Julien Epps, The University of New South Wales / National ICT Australia, 
j.epps [at]unsw.edu.au

Multichannel Processing for Distant Speech Recognition

Distant speech recognition in real-world environments is still a 
challenging problem: reverberation
and dynamic background noise represent major sources of acoustic 
mismatch that heavily decrease ASR
performance, which, on the contrary, can be very good in close-talking 
microphone setups.
In this context, a particularly interesting topic is the adoption of 
distributed microphones for
the development of voice-enabled automated home environments based on 
distant-speech interaction:
microphones are installed in different rooms and the resulting 
multichannel audio recordings capture
multiple audio events, including voice commands or spontaneous speech, 
generated in various locations
and characterized by a variable amount of reverberation as well as 
possible background noise.

The focus of the proposed special session will be on multichannel 
processing for automatic speech recognition (ASR)
in such a setting. Unlike other robust ASR tasks, where static 
adaptation or training with noisy data sensibly
ameliorates performance, the distributed microphone scenario requires 
full exploitation of multichannel
information to reduce the highly variable dynamic mismatch. To 
facilitate better evaluation of the proposed
algorithms the organizers will provide a set of multichannel recordings 
in a domestic environment.
The recordings will include spoken commands mixed with other acoustic 
events occurring in different
rooms of a real apartment.
The data is being created in the frame of the EC project DIRHA (Distant 
speech Interaction for Robust
Home Applications)
which addresses the challenges of speech interaction for home automation.

The organizers will release the evaluation package (datasets and 
scripts) on February 17;
the participants are asked to submit a regular paper reporting speech 
recognition results
on the evaluation set and comparing their performance with the provided 
reference baseline.

Further details are available at: http://dirha.fbk.eu/INTERSPEECH2014

Marco Matassoni, Fondazione Bruno Kessler, matasso [at]fbk.eu
Ramon Fernandez Astudillo, Instituto de Engenharia de Sistemas e 
Computadores, ramon.astudillo [at]inesc-id.pt
Athanasios Katsamanis, National Technical University of Athens, nkatsam 

Open Domain Situated Conversational Interaction

Robust conversational systems have the potential to revolutionize our 
interactions with computers.
Building on decades of academic and industrial research, we now talk to 
our computers, phones,
and entertainment systems on a daily basis. However, current technology 
typically limits conversational
interactions to a few narrow domains/topics (e.g., weather, traffic, 
restaurants). Users increasingly want
the ability to converse with their devices over broad web-scale content. 
Finding something on your PC or
the web should be as simple as having a conversation.
A promising approach to address this problem is situated conversational 
interaction. The approach leverages
the situation and/or context of the conversation to improve system 
accuracy and effectiveness.
Sources of context include visual content being displayed to the user, 
Geo-location, prior interactions,
multi-modal interactions (e.g., gesture, eye gaze), and the conversation 
itself. For example, while a user
is reading a news article on their tablet PC, they initiate a 
conversation to dig deeper on a particular topic.
Or a user is reading a map and wants to learn more about the history of 
events at mile marker 121.
Or a gamer wants to interact with a game’s characters to find the next 
clue in their quest.
All of these interactions are situated – rich context is available to 
the system as a source of priors/constraints
on what the user is likely to say.
This special session will provide a forum to discuss research progress 
in open domain situated
conversational interactions.
Topics of the session will include:
• Situated context in spoken dialog systems
• Visual/dialog/personal/geo situated context
• Inferred context through interpretation and reasoning
• Open domain spoken dialog systems
• Open domain spoken/natural language understanding and generation
• Open domain semantic interpretation
• Open domain dialog management (large-scale belief state/policy)
• Conversational Interactions
• Multi-modal inputs in situated open domains (speech/text + gesture, 
touch, eye gaze)
• Multi-human situated interactions

Larry Heck, Microsoft Research, larry [at]ieee.org
Dilek Hakkani-Tür, Microsoft Research, dilek [at]ieee.org
Gokhan Tur, Microsoft Research, gokhan [at]ieee.org
Steve Young, Cambridge University, sjy [at]eng.cam.ac.uk

Phase Importance in Speech Processing Applications

In the past decades, the amplitude of speech spectrum is considered to 
be the most important feature in
different speech processing applications and phase of the speech signal 
has received less
attention. Recently, several findings justify the phase importance in 
speech and audio processing communities.
The importance of phase estimation along with amplitude estimation in 
speech enhancement,
complementary phase-based features in speech and speaker recognition and 
phase-aware acoustic
modeling of environment are the most prominent
reported works scattered in different communities of speech and audio 
processing. These examples suggest
that incorporating the phase information can push the limits of 
state-of-the-art phase-independent solutions
employed for long in different aspects of audio and speech signal 
processing. This Special Session aims
to explore the recent advances and methodologies to exploit the 
knowledge of signal phase information in different
aspects of speech processing. Without a dedicated effort to bring 
researchers from different communities,
a quick advance in investigation towards the phase usefulness in speech 
processing applications
is difficult to achieve. Therefore, as the first step in this direction, 
we aim to promote the "phase-aware
speech and audio signal processing" to form a community of researchers 
to organize the next steps.
Our initiative is to unify these efforts to better understand the pros 
and cons of using phase and the degree
of feasibility for phase estimation/enhancement in different areas of 
speech processing including: speech
enhancement, speech separation, speech quality estimation, speech and 
speaker recognition,
voice transformation and speech analysis and synthesis. The goal is to 
promote the importance of
the phase-based signal processing and studying its importance and 
sharing interesting findings from different
speech processing applications.

Pejman Mowlaee, Graz University of Technology, pejman.mowlaee [at]tugraz.at
Rahim Saeidi, University of Eastern Finland, rahim.saeidi [at]uef.fi
Yannis Styilianou, Toshiba Labs Cambridge UK / University of Crete, 
yannis [at]csd.uoc.gr

Speaker Comparison for Forensic and Investigative Applications

In speaker comparison, speech/voice samples are compared by humans 
and/or machines
for use in investigation or in court to address questions that are of 
interest to the legal system.
Speaker comparison is a high-stakes application that can change people’s 
lives and it demands the best
that science has to offer; however, methods, processes, and practices 
vary widely.
These variations are not necessarily for the better and though 
recognized, are not generally appreciated
and acted upon. Methods, processes, and practices grounded in science 
are critical for the proper application
(and non-application) of speaker comparison to a variety of 
international investigative and forensic applications.
This special session will contribute to scientific progress through 1) 
understanding speaker comparison
for investigative and forensic application (e.g., describe what is 
currently being done and critically
analyze performance and lessons learned); 2) improving speaker 
comparison for investigative and forensic
applications (e.g., propose new approaches/techniques, understand the 
limitations, and identify challenges
and opportunities); 3) improving communications between communities of 
researchers, legal scholars,
and practitioners internationally (e.g., directly address some central 
legal, policy, and societal questions
such as allowing speaker comparisons in court, requirements for expert 
witnesses, and requirements for specific
automatic or human-based methods to be considered scientific); 4) using 
best practices (e.g., reduction of bias
and presentation of evidence); 5) developing a roadmap for progress in 
this session and future sessions; and 6)
producing a documented contribution to the field. Some of these 
objectives will need multiple sessions
to fully achieve and some are complicated due to differing legal systems 
and cultures.
This special session builds on previous successful special sessions and 
tutorials in forensic applications
of speaker comparison at INTERSPEECH beginning in 2003. Wide 
international participation is planned,
including researchers from the ISCA SIGs for the Association Francophone 
de la Communication Parlée (AFCP)
and the Speaker and Language Characterization (SpLC).

Joseph P. Campbell, PhD, MIT Lincoln Laboratory, jpc [at]ll.mit.edu
Jean-François Bonastre, l'Université d'Avignon, jean-francois.bonastre 

Text-dependent for Short-duration Speaker Verification

In recent years, speaker verification engines have reached maturity and 
have been deployed in
commercial applications. Ergonomics of such applications is especially 
demanding and imposes
a drastic limitation in terms of speech duration during authentication. 
A well known tactic to address
the problem of lack of data, due to short duration, is using 
text-dependency. However, recent breakthroughs
achieved in the context of text-independent speaker verification in 
terms of accuracy and robustness
do not benefit text-dependent applications. Indeed, large development 
data required by the recent
approaches is not available in the text-dependent context. The purpose 
of this special session is
to gather the research efforts from both academia and industry toward a 
common goal of establishing
a new baseline and explore new directions for text-dependent speaker 
The focus of the session is on robustness with respect to duration and 
modeling of lexical information.
To support the development and evaluation of text-dependent speaker 
verification technologies,
the Institute for Infocomm Research (I2R) has recently released the 
RSR2015 database,
including 150 hours of data recorded from 300 speakers. The papers 
submitted to the special
session are encouraged, but not limited, to provide results based on the 
RSR2015 database
in order to enable comparison of algorithms and methods. For this 
purpose, the organizers strongly
encourage the participants to report performance on the protocol 
delivered with the database
in terms of EER and minimum cost (in the sense of NIST 2008 Speaker 
Recognition evaluation).
To get the database, please contact the organizers.

Further details are available at: 

Anthony LARCHER (alarcher [at]i2r.a-star.edu.sg) Institute for Infocomm 
Hagai ARONOWITZ (hagaia [at]il.ibm.com) IBM Research – Haifa
Kong Aik LEE (kalee [at]i2r.a-star.edu.sg) Institute for Infocomm Research
Patrick KENNY (patrick.kenny [at]crim.ca) CRIM – Montréal

Tutorial Dialogues and Spoken Dialogue Systems

The growing interest in educational applications that use spoken 
interaction and dialogue technology has boosted
research and development of interactive tutorial systems, and over the 
recent years, advances have been achieved
in both spoken dialogue community and education research community, with 
sophisticated speech and multi-modal
technology which allows functionally suitable and reasonably robust 
applications to be built.

The special session combines spoken dialogue research, interaction 
modeling, and educational applications,
and brings together the two INTERSPEECH SIG communities: SLaTE and 
SIGdial. The session focuses
on methods, problems and challenges that are shared by both communities, 
such as sophistication
of speech processing and dialogue management for educational 
interaction, integration of the models
with theories of emotion, rapport, and mutual understanding, as well as 
application of the techniques
to novel learning environments, robot interaction, etc. The session aims 
to survey issues related
to the processing of spoken language in various learning situations, 
modeling of the teacher-student
interaction in MOOC-like environments, as well as evaluating tutorial 
dialogue systems from
the point of view of natural interaction, technological robustness, and 
learning outcome.
The session encourages interdisciplinary research and submissions 
related to the special focus
of the conference, "Celebrating the Diversity of Spoken Languages".

For further information click http://junionsjlee.wix.com/INTERSPEECH

Maxine Eskenazi, max+ [at]cs.cmu.edu
Kristiina Jokinen, kristiina.jokinen [at]helsinki.fi
Diane Litman, litman [at]cs.pitt.edu
Martin Russel, M.J.RUSSELL [at]bham.ac.uk

--- Visual Speech Decoding

Speech perception is a bi-modal process that takes into account both the 
acoustic (what we hear)
and visual (what we see) speech information. It has been widely 
acknowledged that visual clues play
a critical role in automatic speech recognition (ASR) especially when 
audio is corrupted by,
for example, background noise or voices from untargeted speakers, or 
even inaccessible.
Decoding the visual speech is utterly important for ASR technologies to 
be widely implemented
to realize truly natural human-computer interactions. Despite the 
advances in acoustic ASR,
visual speech decoding remains a challenging problem.
The special session aims to attract more effort to tackle this important 
problem. In particular,
we would like to encourage researchers to focus on some critical 
questions in the area.
We propose four questions as the initiative as follows:
1. How to deal with the speaker dependency in visual speech data?
2. How to cope with the head-pose variation?
3. How to encode temporal information in visual features?
4. How to automatically adapt the fusion rule when the quality of the 
two individual (audio and visual)
modalities varies?

Researchers and participants are encouraged to raise more questions 
related to visual speech decoding.
We expect the session to draw a wide range of attention from both the 
speech recognition and machine vision
communities to the problem of visual speech decoding.

Ziheng Zhou, University of Oulu, ziheng.zhou [at]ee.oulu.fi
Matti Pietikäinen, University of Oulu, matti.pietikainen [at]ee.oulu.fi
Guoying Zhao, University of Oulu, gyzhao [at]ee.oulu.fi

More information about the Elsnet-list mailing list