[Elsnet-list] Pascal Challenge on Evaluating Machine Learning for
Information Extraction from Documents
f.ciravegna at dcs.shef.ac.uk
Mon Jun 14 13:16:23 CEST 2004
****apologies for multiple postings****
First Announcement and Call for Participation in the
Pascal Challenge on Evaluating Machine Learning for Information
Extraction from Documents
The Dot.Kom European project and the Pascal Network of Excellence invite
you in participating in the Challenge on Evaluation of Machine Learning
for Information Extraction from Documents. Goal of the challenge is to
assess the current situation concerning Machine Learning (ML) algorithms
for Information Extraction (IE), identifying future challenges and to
foster additional research in the field. Given a corpus of annotated
documents, the participants will be expected to perform a number of
tasks; each examining different aspects of the learning process.
Full description of the challenge can be found at
A standardised corpus of 1100 Workshop Call for Papers (CFP) will be
provided. 600 of these documents will be annotated with 12 tags that
relate to pertinent information (names, locations, dates, etc.). Of the
annotated documents 400 will be provided to the participants as a
training set, the remaining 200 will form the unseen test set used in
the final evaluation. All the documents will be pre-processed to include
tokenisation, part-of-speech and named-entity information.
Full scenario: The only mandatory task for participants is learning to
annotate implicit information: given the 400 training documents, learn
the textual patterns necessary to extract the annotated information.
Each participant provides results of a four-fold cross-validation
experiment using the same document partitions for pre-competitive tests.
A final test will be performed on the 200 unseen documents.
Active learning: Learning to select documents: the 400 training
documents will be divided into fixed subsets of increasing size (e.g.
10, 20, 30, 50, 75, 100, 150, and 200). The use of the subsets for
training will show effect of limited resources on the learning process.
Secondly, given each subset the participants can select the documents to
add to increment to the next size (i.e. 10 to 20, 20 to 30, etc.), thus
showing the ability to select the most suitable set of documents to
Enriched Scenario: the same procedure as task 1, except the participants
will be able to use the unannotated part of the corpus (500 documents).
This will show how the use of unsupervised or semi-supervised methods
can improve the results of supervised approaches. An interesting variant
of this task could concern the use of unlimited resources, e.g. the Web.
Participants from different fields such as machine learning, text
mining, natural language processing, etc. are welcome. Participation in
the challenge is free. After registration, participant will receive the
corpus of documents to train on and the precise instructions on the
tasks to be performed. At an established date, participants will be
required to submit their systems answers via a Web portal. An automatic
scorer will compute the accuracy of extraction. A paper will have to be
produced in order to describe the system and the results obtained.
Results of the challenge will be discussed in a dedicated workshop.
- 30th June 2004: Registration starts: formal definition of the tasks,
annotated corpus and evaluation server will be made available to
- 15th October 2004: Formal evaluation
- November 2004: Presentation of evaluation at Pascal workshop
Fabio Ciravegna: University of Sheffield, UK; (coordinator)
Mary Elaine Califf, Illinois State University, USA,
Dayne Freitag, Fair Isaac Technologies, USA;
Nicholas Kushmerick: University College Dublin, Ireland;
Alberto Lavelli: ITC-Irst, Italy
Local Organizer: Neil Ireson, University of Sheffield.
For further details about the challenge, visit
For general enquiries about the challenge and its motivations, contact
Fabio Ciravegna (F.Ciravegna at dcs.shef.ac.uk). For details about
participation, registration and technical queries, please contact Neil
Ireson (N.Ireson at dcs.shef.ac.uk).
Professor Fabio Ciravegna,
Department of Computer Science, University of Sheffield,
Regent Court, 211 Portobello Street, S1 4DP, Sheffield, UK
More information about the Elsnet-list