Mihael Arčan

Mihael Arčan is a PhD Student at Insight@NUI Galway. He is working in the Unit for Natural Language Processing (UNLP) under the supervision of Dr. Paul Buitelaar, where his main research topic focuses on term translation with statistical machine translation (SMT).

Mihael studied German language at the University of Ljubljana, where he finished his Diploma thesis on “Named Entity Recognition for German and Slovene” under the supervision of Stojan Bračič and Špela Vintar.

In 2009, he earned his master’s degree in Computational Linguistics at the University of the Ruhr in Bochum, Germany. His master’s thesis work under the supervision of Prof. Dr. Ralf Klabunde dealt with the extraction of semantic relations from the Slovenian national corpus.

After studying in Germany, he worked for Lionbridge as a developer of language technologies for Slovene, and for the Slovenian Project “Communication in Slovene”.

At Insight@NUI Galway (and previously in DERI) he was working on the MONNET (Multilingual Ontologies for Networked Knowledge) project, which was an European Project with the goal to provide standards and technology to facilitate multilingual access to Semantic Web knowledge resources. Currently he is part of the EUROSENTIMENT project focusing on the challenges on the multilinguality of the resources within the project.

Statistical Machine Translation and Terminology

Professional translators deal on a daily basis with texts coming from different domains (information technology (IT), legal, agriculture, etc.), which require a specific lexical knowledge of the domain.
Nowadays, statistical machine translation (SMT) systems are suitable to translate very frequent expressions, but fail in translating domain-specific terms. This mostly depends on a lack of domain-specific parallel data from which the SMT systems can learn. Generic models such as Google Translate and Bing Translator, are the most common solutions, and are often used to translate manuals or very specific texts resulting in modest translations.
On the other hand, online terminological resources (e.g. the ‘Interactive Terminology for Europe’, IATE) are a valuable and fundamental support for translators, although their continuous use can be time demanding. For all these reasons, the integration of the terminological knowledge in the SMT system is a crucial step to increase translator productivity and limit their initial overload when working in different domains.
The talk will give an overview on how an SMT system generates a translation from source language to target language. The main focus of the presentation will centre on the embedding of terminology into open source phrase-based SMT (PB-SMT) systems such as Moses. Finally, the talk will conclude with a discussion about the future of SMT.