[Elsnet-list] DGT-ACQUIS: New freely available large-scale aligned parallel corpus in 23 languages
ralf.steinberger at jrc.ec.europa.eu
Wed Nov 28 10:26:38 CET 2012
Following the release of the JRC-Acquis in 2006, the DGT-Translation Memory in several releases since 2007 and the ECDC-Translation Memory in 2012, we are now releasing the new parallel corpus DGT-Acquis. DGT-Acquis has been produced by the European Commission’s Directorate General for Translation (DGT) and it is being distributed by the Joint Research Centre (JRC).
DGT-Acquis is a parallel collection of manually translated full-text documents in all 23 official EU languages, that has been paragraph-aligned for all 253 language pairs. It has been produced on the basis of the Official Journal (OJ) of the European Union (more specifically the L, LM, C, CA and CE Series).
Languages: All 253 language pairs involving the following 23 languages:
Bulgarian, Czech, Danish, Dutch, English, Estonian, German,
Greek, Finnish, French, Irish, Hungarian, Italian, Latvian,
Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak,
Slovene, Spanish and Swedish.
Creator: European Commission - Directorate General for Translation ( <http://ec.europa.eu/dgs/translation/index_en.htm> DGT)
Size: 3.54 million files; 5 GB in plain text format
WHAT IS DGT-Acquis
DGT-Acquis consists of a collection of Official Journal issues published in up to 23 languages between 2004 and 2011. The full-text documents have been paragraph-aligned automatically for all language pairs. The data is being distributed in several formats: (1) the original XML data and its corresponding TIFF files; (2) file level data in Formex4 format; (3) file level data in plain text format; and (4) the same data aligned at paragraph level. Users can thus make use of the aligned data or they can re-process the data using their own tools and methods.
WHAT IS the difference between DGT-Acquis and the other resources distributed by the JRC
While the translation memories DGT-TM and ECDC-TM are collections of individual translation units (or sentences) taken out of their full-text context, both JRC-Acquis and DGT-Acquis consist of full-text documents aligned at sentence or paragraph level. This allows using the data for applications that need to analyse entire texts, e.g. for discourse structure analysis, to detect domain information, for experiments on automatic summarisation, for translation studies, etc.
Regarding the contents of the documents, JRC-Acquis and DGT-Acquis partially overlap for the period 2004 to 2006 while the documents for all other time periods should be unique. Comparing the resources used to produce DGT-Acquis and DGT-TM, DGT-TM is based exclusively on the L-Series of the Official Journal, while DGT-Acquis also contains the LM, C, CA and CE collections.
The processing steps (data preparation and alignment) to produce the various data sets were entirely different. The format is not the same, and the processing quality of each of the resources is expected to be different, as well. For details on the resources and on the overlap between them, see the detailed descriptions of the resources at http://ipsc.jrc.ec.europa.eu/index.php?id=61.
MOTIVATION FOR THIS RELEASE
The public data release is in line with the general effort of the European Commission to support multilingualism, language diversity and the re-use of Commission information. It follows the release of the JRC-Acquis parallel corpus in 2006 (over 1 billion words in 22 languages), of the DGT-TM Translation Memory since 2007, the multilingual named entity resource JRC-Names in 2011, the multilingual multi-label classification tool (and accompanying text data) JRC EuroVoc Indexer (JEX) (22 languages), and further smaller multilingual resources. See http://ipsc.jrc.ec.europa.eu/index.php?id=61 for more information on these resources.
WHAT DGT-ACQUIS CAN BE USED FOR
DGT-ACQUIS is a large parallel corpus in electronic form. It can be used by specialists in computational linguistics to train statistical machine translation software, to generate multilingual dictionaries, to train and test multilingual information extraction software, to carry out testing and training of summarisation or discourse analysis software, to train and test cross-lingual clustering and classification, and more. Parallel corpora are also particularly useful for annotation projection across languages <http://publications.jrc.ec.europa.eu/repository/handle/111111111/1/simple-search?query=%28%28author%3ASteinberger%29+AND+%28title%3AAnnotation+title%3AParallel%29%29&from_advanced=true&conjunction3=AND&field4=type&conjunction2=AND&field3=ANY&field2=title&conjunction1=AND&query4=&field1=author&query1=Steinberger&query2=Annotation+Parallel&query3=&num_search_field=4> , which saves annotation effort and thus facilitates the development of highly multilingual text processing software.
MORE INFORMATION ON DGT-ACQUIS
At http://langtech.jrc.ec.europa.eu/JRC_Publications.html , you find detailed publications on the JRC’s multilingual language technology activity. For details on DGT-Acquis, however, there is not currently yet any detailed publication. Until further notice, please make reference to it by pointing to the web page http://langtech.jrc.ec.europa.eu/DGT-Acquis.html.
The JRC and collaborating European Union services are currently finalising the release of further highly multilingual linguistic resources.
Ralf Steinberger <http://langtech.jrc.ec.europa.eu/RS.html>
European Commission - Joint Research Centre (JRC)
21027 Ispra (VA), Italy
URL – Applications: <http://emm.newsbrief.eu/overview.html> http://emm.newsbrief.eu/overview.html
URL – Resources: http://ipsc.jrc.ec.europa.eu/index.php?id=61
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Elsnet-list