[Elsnet-list] Call for participation DEFT'08

Martine Hurault-Plantet Martine.Hurault-Plantet at limsi.fr
Wed Jan 2 15:37:23 CET 2008


******************************************************************
DEFT'08 Call for participation

Evaluation workshop in text mining:
Text classification by topic and by genre.

http://deft08.limsi.fr/
Registration : http://deft08.limsi.fr/inscription.php

******************************************************************
Important dates :

Registration : from december 21, 2007
Training corpora : january 14, 2008
Test : three days during the last two weeks of march 2008
Workshop : june 9-13, during TALN'08 conference, Avignon, France
(TALN : Traitement Automatique du Langage Naturel)

******************************************************************
The Text Mining Challenge (DEFT for "Défi Fouille de Texte", 
http://deft.limsi.fr/) has been proposing evaluation campaigns for the last 
three years in the textmining field, in French. The 2008 edition involves the 
processing of genre and topic variation in an automatic classifier. Corpora 
will be in French.

Automatic classification has many applications within text mining. From email 
routing to strategic or scientific lookout, various fields of application 
have been explored. These last years, a new problematics is emerging, which 
concerns classifications of texts by genre. Beyond the recognition of a 
document's topic, finding its genre is useful for guiding the possible use of 
this document. But how can we recognize both the topic and the genre of a 
given document? Is genre difference relevant when recognizing its topical 
category, and, conversely, is a difference in topic relevant during genre 
recognition?

In order to evaluate recognition software in this perspective, we shall 
simultaneously consider, for the same pre-defined set of categories, two 
corpora in French with different genres. One is a corpus of press articles 
from Le Monde (a daily newspaper), and the other one is a corpus of 
encyclopedic articles from the French version of Wikipedia, the free online 
encyclopedia. What we mean here by 'genre' refers to a set of texts having 
some properties in common, involving their domain of activity, their writing 
practices ant their support. A newspaper article deals with current events, 
while an encyclopedic article transmits knowledge, but between them both, 
they share some general topical categories (called 'sections' in the case of 
the newspaper). The issue will be to test, on these corpora, first, the 
robustness of a topic classifier submitted to genre variation, and secondly, 
the possible improvements of topical classification by text genre 
recognition.

Task description
****************
We provide two French corpora for the training of the task:

- one with articles from Le Monde (a daily newspaper) and articles from the 
French version of Wikipedia, within a set 'A' of topical categories, with a 
double tagging, both by genre and by topic,

- one with articles from Le Monde and articles from the French version of 
Wikipedia, within a set 'B' of topical categories, different from 'A', and 
whose tagging is only topical.

We will provide two French corpora for the test, with no tagging at all, each 
one being used for a different task from another:

- task 1: genre and topic recognition of each document from a corpus with 
articles from Le Monde and articles from the French version of Wikipedia, 
within the set 'A' of topical categories,

- task 2: topic recognition of each document from a corpus with articles from 
Le Monde and articles from the French version of Wikipedia, within the 
set 'B' of topical categories.

Registration
************
Teams taking part in DEFT'08 should register by filling the online form 
(http://deft08.limsi.fr/inscription.php), and sign the agreements about 
restrictions on use of corpora.

Committees
**********
Organizing committee:
Martine Hurault-Plantet (LIMSI), Cyril Grouin (LIMSI), Sylvain Loiseau 
(LIMSI), Jean-Baptiste Berthelin (LIMSI), Sarra El Ayari (LIMSI)

Program committee:
Patrick Paroubek (LIMSI),
Catherine Berrut (CLIPS), 
Fabrice Clérot (France Telecom),
Guillaume Cleuziou (LIFO), 
Béatrice Daille (LINA),
Marc El-Bèze (LIA),
Patrick Gallinari (LIP6), 
Eric Gaussier (Xerox Research),
Thierry Hamon (LIPN), 
Fidélia Ibekwe-SanJuan (ELICO),
Pascal Poncelet (LGI2P), 
Christophe Roche (LISTIC), 
Mathieu Roche (LIRMM), 
Pascale Sébillot (IRISA),
Yannick Toussaint (LORIA), 
François Yvon (LIMSI).


More information about the Elsnet-list mailing list