Go to content

Attachments

Attachment 1. EU Language Technology Resources

EU Language Technology Resources is a translation memory that is a source for machine translation, as it provides a collection of small pieces of text and their translations. The following translation memories listed below are examples of EU translation memories. These translation memories and parallel texts can be important for several reasons:
  • It can be used to train a machine translator using statistical machine translation (SMT)
  • Training and testing of multilingual data extraction software
  • Translations can be consistently checked automatically
  • Ready-made data packages for many different languages
The EU-enabled resource is based on parallel text materials (parallel corpora) related to machine translation based on neural networks. Parallel language material is a large structured/controlled set of translated texts between two languages.
When using EU Language Technology Resources, the user should consider which languages and data the resource packages contain, and in which format. The packages contain different languages and purposes for machine translation memory. It must be considered that one package does not contain the necessary memory to serve the needs.
Here is a list of some of the EU Language Technology Resources to demonstrate what is in the resources:

JRC-Acquis

This collection of documents, and their manually produced translations, can be used for many purposes, including the training of statistical machine translation systems, the training and testing of text mining applications.
Languages: Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish.
Missing Nordic and Baltic languages: Norwegian, Islandic

DGT-Acquis

This collection of aligned full-text documents, and their manually produced translations, can be used for many purposes, including the training of statistical machine translation systems, the training and testing of text mining applications, and more.
Languages: Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Irish, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish.
Missing Nordic and Baltic languages: Norwegian, Islandic

DCEP-Digital Corpus of the European Parliament

The corpus includes a variety of different text types, including press releases, motions, minutes of plenary sessions, rules or procedure, reports and written questions to the parliament. This collection of sentence-aligned, full-text documents, and their manually produced translations, can be used for many purposes, including the training of statistical machine translation systems, the training and testing of text mining applications and more.
Languages: Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Irish, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Turkish.
Missing Nordic and Baltic languages: Norwegian, Islandic

DGT-Translation Memory (DGT-TM)

Translation memories are collections of small pieces of text and their manually produced translations. Translation memories are typically used to support human translators, but they can also be used to train statistical machine translation systems. DGT-TM consists of between 4 and 7 million units per language. It is distributed in the widely used TMX format.
Languages: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Irish, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish.
Missing Nordic and Baltic languages: Norwegian, Islandic

EAC-Translation Memory (EAC-TM)

The parallel corpus was provided by the European Commission’s Directorate General for Education and Culture (EAC), and the data have been processed further by the JRC. The EAC-TM is smaller compared to the other parallel corpora available here, but it has the advantage that it focuses on a very different domain. EAC-TM consists of a total of over 32,000 units. It is distributed in the widely used TMX format.
Languages: Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Croatian, Hungarian, Icelandic, Italian, Latvian, Lithuanian, Maltese, Norwegian, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish and Turkish.

ECDC-Translation Memory (ECDC-TM)

The majority of the documents talk about health-related topics (anthrax, botulism, cholera, dengue fever, hepatitis, etc.), but some of the web pages also describe the organisation ECDC (e.g., its organisation, job opportunities) and its activities (e.g., epidemic intelligence, surveillance). ECDC-TM consists of up to 2500 translation units per language. It is distributed in the widely used TMX format.
Languages: Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Hungarian, Icelandic, Irish, Italian, Latvian, Lithuanian, Maltese, Norwegian, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish.