[Elsnet-list] Pascal Challenge on Evaluating Machine Learning for Information Extraction from Documents

Fabio Ciravegna f.ciravegna at dcs.shef.ac.uk
Mon Jun 14 13:16:23 CEST 2004

****apologies for multiple postings****

First Announcement and Call for Participation in the

Pascal Challenge on Evaluating Machine Learning for Information
Extraction from Documents

The Dot.Kom European project and the Pascal Network of Excellence invite 
you in participating in the Challenge on Evaluation of Machine Learning 
for Information Extraction from Documents. Goal of the challenge is to 
assess the current situation concerning Machine Learning (ML) algorithms 
for Information Extraction (IE), identifying future challenges and to 
foster additional research in the field. Given a corpus of annotated 
documents, the participants will be expected to perform a number of 
tasks; each examining different aspects of the learning process.
Full description of the challenge can be found at 


A standardised corpus of 1100 Workshop Call for Papers (CFP) will be 
provided. 600 of these documents will be annotated with 12 tags that 
relate to pertinent information (names, locations, dates, etc.). Of the 
annotated documents 400 will be provided to the participants as a 
training set, the remaining 200 will form the unseen test set used in 
the final evaluation. All the documents will be pre-processed to include 
tokenisation, part-of-speech and named-entity information.


Full scenario: The only mandatory task for participants is learning to 
annotate implicit information: given the 400 training documents, learn 
the textual patterns necessary to extract the annotated information. 
Each participant provides results of a four-fold cross-validation 
experiment using the same document partitions for pre-competitive tests. 
A final test will be performed on the 200 unseen documents.

Active learning: Learning to select documents: the 400 training 
documents will be divided into fixed subsets of increasing size (e.g. 
10, 20, 30, 50, 75, 100, 150, and 200). The use of the subsets for 
training will show effect of limited resources on the learning process. 
Secondly, given each subset the participants can select the documents to 
add to increment to the next size (i.e. 10 to 20, 20 to 30, etc.), thus 
showing the ability to select the most suitable set of documents to 

Enriched Scenario: the same procedure as task 1, except the participants 
will be able to use the unannotated part of the corpus (500 documents). 
This will show how the use of unsupervised or semi-supervised methods 
can improve the results of supervised approaches. An interesting variant 
of this task could concern the use of unlimited resources, e.g. the Web.


Participants from different fields such as machine learning, text 
mining, natural language processing, etc. are welcome. Participation in 
the challenge is free. After registration, participant will receive the 
corpus of documents to train on and the precise instructions on the 
tasks to be performed. At an established date, participants will be 
required to submit their systems’ answers via a Web portal. An automatic 
scorer will compute the accuracy of extraction. A paper will have to be 
produced in order to describe the system and the results obtained. 
Results of the challenge will be discussed in a dedicated workshop.


- 30th June 2004: Registration starts: formal definition of the tasks, 
annotated corpus and evaluation server will be made available to 
- 15th October 2004: Formal evaluation
- November 2004: Presentation of evaluation at Pascal workshop


Fabio Ciravegna: University of Sheffield, UK; (coordinator)
Mary Elaine Califf, Illinois State University, USA,
Dayne Freitag, Fair Isaac Technologies, USA;
Nicholas Kushmerick: University College Dublin, Ireland;
Alberto Lavelli: ITC-Irst, Italy

Local Organizer: Neil Ireson, University of Sheffield.

Further Information

For further details about the challenge, visit 

For general enquiries about the challenge and its motivations, contact 
Fabio Ciravegna (F.Ciravegna at dcs.shef.ac.uk). For details about 
participation, registration and technical queries, please contact Neil 
Ireson (N.Ireson at dcs.shef.ac.uk).

Professor Fabio Ciravegna,
Department of Computer Science, University of Sheffield,
Regent Court, 211 Portobello Street, S1 4DP, Sheffield, UK
Tel:+44(0)114-22.21940, Fax:+44(0)114-22.21810
www: http://www.dcs.shef.ac.uk/~fabio/

More information about the Elsnet-list mailing list