[Elsnet-list] Freely available JRC-Acquis parallel corpus (22 languages) tripled in size

Ralf Steinberger ralf.steinberger at jrc.it
Mon Apr 30 18:46:50 CEST 2007


The freely available JRC-Acquis parallel corpus (22 languages) tripled in
size

 

We are pleased to announce a new release of the freely available
multilingual parallel corpus JRC-Acquis. The corpus size has nearly tripled
(totalling over 1 Billion words) and Bulgarian texts have now been added
(thanks to the Romanian Academy of Sciences) so that the parallel texts are
now available in 22 languages. 

 

SIZE AND FORMAT

 

- 22 languages (all official EU languages except Irish)

- Average corpus size per language: 28.9 million words + 19 Million words in
annexes, etc.

- 23,000 texts per language (less in Bulgarian, Maltese and Romanian)

- XML Format according to TEI P4, UTF-8-encoded

- Modular: download the languages you need.

 

LANGUAGES

 

Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish,
French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish,
Portuguese, Romanian, Slovak, Slovene, Spanish, Swedish.

 

TEXT TYPES

 

- Documents on contents, principles and political objectives of the EU
Treaties

- EU legislation

- Declarations

- Resolutions

- Acts

- International agreements.

 

PARAGRAPH ALIGNMENT

 

Paragraph alignment for all 231 language pairs will soon be available for
version 3.0 of the corpus. The following text applies to version 2.2, still
available on the same website:

 

- Paragraph-aligned for all 210 language pairs

- Paragraphs are sentence parts, sentences, or groups of sentences

- 2 alternative alignments: using Vanilla and HunAlign

- Ca. 270,000 alignments per language pair.

 

MANUAL SUBJECT DOMAIN CLASSIFICATION

 

- Manually classified according to EUROVOC subject domains

- Selected from 6000 hierarchically organised classes, wide-coverage.

 

USE / DOWNLOAD

 

- Download from  <http://langtech.jrc.it/JRC-Acquis.html>
http://langtech.jrc.it/JRC-Acquis.html 

- Usage free for research purposes.

 

FOR MORE DETAILS

 

Steinberger Ralf,  Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž
Erjavec, Dan Tufiş, Dániel Varga (2006). 'The JRC-Acquis: A multilingual
aligned parallel corpus with 20+ languages'. Proceedings of the 5th
International Conference on Language Resources and Evaluation (LREC'2006).
Genoa, Italy, 24-26 May 2006. Available at
<http://langtech.jrc.it/#Publications> http://langtech.jrc.it/#Publications.


 

 <http://langtech.jrc.it/#Publications> 

 <http://langtech.jrc.it/#Publications> The JRC's Language Technology group
specialises in the development of highly multilingual text analysis tools
and in cross-lingual applications. An example is our multilingual (19
languages) news analysis application NewsExplorer, publicly accessible at
http://press.jrc.it/NewsExplorer. 

 <http://press.jrc.it/NewsExplorer> 

 <http://press.jrc.it/NewsExplorer> Related JRC developments (both covering
22+ languages):

 <http://press.jrc.it/NewsExplorer> 

-  <http://press.jrc.it/NewsExplorer> NewsBrief (http://press.jrc.it):
breaking news detection and display of the very latest thematic news from
around the world;

 <http://press.jrc.it/> 

-  <http://press.jrc.it/> Medical Information System MedISys
(http://medusa.jrc.it): displays the latest health-related news from around
the world according to themes and diseases.

 <http://medusa.jrc.it/> 

 <http://medusa.jrc.it/> 

 <http://medusa.jrc.it/> 

 <http://medusa.jrc.it/> Ralf Steinberger
European Commission - Joint Research Centre (JRC)
IPSC - SeS - EMM - Language Technology 

 <http://medusa.jrc.it/> http://langtech.jrc.it,
http://press.jrc.it/NewsExplorer


 <http://press.jrc.it/NewsExplorer/> 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://stratus.let.uu.nl/pipermail/elsnet-list/attachments/20070430/7a43d467/attachment-0001.htm


More information about the Elsnet-list mailing list