Creator of Tatar online translator: ''We are ready to create not only an analogue of Google or Yandex''
Interview with senior research fellow of the Institute of Applied Semiotics Aydar Khusainov on the solution of machine translation problem
The Institute of Applied Semiotics of the Tatarstan Academy of Sciences created a test version of a neural network based Russian-Tatar translator. It's peculiar because it translates without dictionaries. In an interview with Realnoe Vremya, senior research fellow of the institute Aydar Khusainov told about differences of the translator from Yandex's product and what other problems they had to face when translating from Tatar.
Translator for sentences
Could you explain to ordinary users what a neural dictionary means?
First of all, we should understand the difference between a dictionary and a translator. Conditionally, a dictionary is a word, phrase or vocabulary entry that gives variants of translations, interpretations, shows what part of speech it is, etc. Machine translation is mainly designed to translate sentences and often doesn't translate separate words. We got accustomed that Yandex and Google translate separate words. This happens because dictionaries connect to machine translator. In this project, we create a system to translate sentences from Russian to Tatar, from Tatar to Russian.
What does neural network-based mean?
It's one of the methods, technologies used, for instance, to analyse texts, speech. The majority of tasks linked with AI are performed now with neural networks. Without going into theory, the gist is that corpora or databases with so-called training data are prepared. In our project, we prepare data that allow the system to understand how texts are translated from one language to another. In other words, in our case, it's a big base of parallel Russian-Tatar texts.
How does it look? It's pairs of sentences: in the Russian language and translation in the Tatar language. And there is a big number of such pairs. Millions are needed, at least two million pairs of sentences. The neural network is tuned in a way that it is given data and it studies these data according to a certain algorithm, by certain rules, processes and tries to understand how to translate what it hasn't seen yet, that's to say, a new sentence. The more, the better, the more diverse training base is, the better what a user is inserting will be translated.
''Not such a big number of products have the Tatar language''
How did the idea come about, in general? Why did you decide to create such a translator?
The idea of creating machine translators isn't new. In general, successful neural translators have been created in the last years. Initially, the task of machine translation was carried out with a rule-based approach. They tried to create rules that translated with the help of dictionaries, knowledge about the structure of languages. Such an approach is also used in our institute, but it's successfully applied for other language pairs. It's very good when languages are cognate, similar. So we create a translator for Turkic languages: Tatar, Kazakh, Kyrgyz, Uzbek and others. The structure of sentences is similar, word building is similar, and a good product can be created with these rules. And the Russian and Tatar languages are so different, that nobody has managed to do good translation with rules. The neural approach, the recent successes in machine learning and the results accumulated by our institute allowed to achieve a higher translation level.
''The majority of tasks linked with AI are performed now with neural networks.'' Photo: hi-news.ru
What's the difference of your development from other translators?
Not such a big number of products have the Tatar language. If we talk about a machine translator, at the moment only Yandex allows to translate from the Tatar language and to it, Google Translator doesn't have Tatar. We probably can say that Google, Yandex and other big IT companies use almost the same developments published in leading magazines and presented at international conferences. We keep up to date here. But our development is peculiar because, initially, we didn't try to make a universal translator for a big number of languages. We knew we needed to work with the Tatar language. Our institute had the groundwork in different models – syntactic, morphological; we have analysers. We planned from the beginning we would consider the specifics of the Tatar language. In addition, within the state programme on conservation, learning and development of languages in the Republic of Tatarstan, we have a series of related projects, which also are about speech technologies, recognition and synthesis of Tatar speech – to be able to dictate in Tatar and the computer could vocalise the text in Tatar. We also include these products in the translator. What does it give us? A user can not only type and see a translation but also dictate text and hear the translation in the Tatar and Russian languages. At the moment even Yandex doesn't have such a function for Tatar.
Not only the population in Tatarstan is interested in such translation services. Also, there is a need to translate different official documents, the news must be in the two languages. We are ready to create not only a universal translator for the population – an analogue of Google or Yandex, we are ready to create separate models to translate laws, official and business materials and so on.
''There is a big amount of searches of Russian-Tatar translation on the Internet''
How will this translator look like? Or how does it already look like? I mean I will type or say something, and there will be a translation?
A machine translator in itself is a programme. We created a website for the population that is already running in test mode. A user visits the site that has two main text fields. One is to enter text in the Russian or Tatar language. The second, consequently, is to show a translation. In addition, there are special buttons to dictate, vocalise the translation and so on. It's a site with a very simple interface.
When did you start working on this project? What stage is it in now?
We've been creating the neural translator for several years and we've been using the above-mentioned state programme. According to it, we have plans we are to accomplish. This programme ends by late 2020, and we claimed to build a generally accessible translator that would work with certain quality with an installed number of themes. At the moment, we not only accomplish these plans but also go ahead. We decided to test the generally accessible translator this year already, not in late 2020. All the basic elements have been created: full versions for speech analysis/synthesis, translation, the website itself, there are server-related elements too. However, each of these elements will go on improving.
''A user can not only type and see translation but also dictate text and hear the translation in the Tatar and Russian languages. At the moment even Yandex doesn't have such a function for Tatar.'' Photo: kloop.kg
When do you plan to present the final version?
There is no such a concept as final version for a translator because no language in the world has solved the task of machine translation yet, even such popular pairs as English-German, English-Chinese. Now there is a task to gradually improve the quality of this translation in all directions, so that it will serve as a good help for both professional translators and all people.
When will users be able to use the translator, give or take?
Mainly this doesn't depend on us because we've tested it this year. In general, we have all the tools, we are ready to provide them to users. There is a problem linked with servers. Research, science and software development issues are financed by the state programme. While the purchase of servers, support problems go beyond. Consequently, we hope we will solve this problem next year with state support, private businesses. Now we can't provide access because the current servers are on a lease and aren't designed for a big number of users. We can't omit that the Russian-Tatar translation is interesting for users, and there is a big amount of searches on the Internet.
Enough money for server?
You said the project was financed by the state programme. How much money was allocated?
In general, exact numbers can be seen, all reports are published on the Internet. I can say that during the first year of this event, precisely the machine translator, the money was spent on the joint work with ABBYY. What was it about? It was dedicated to the introduction of SmartCAT project in Tatarstan. It's a tool that also uses machine translators inside. It's created to enable a professional translator to translate documents daily much faster. If a phrase is translated once, next time the system will offer the translation itself, and it will be necessary to just make some amendments to documents. This system was introduced in some establishments, departments, ministries of the republic but requires more active integration and use. The financing in 2019-2020 is planned to improve the machine translator only, and about 2,4m rubles in total are allocated for the next two years.
Is it relatively little?
It's not relatively little, it's too little. Large IT and scientific centres often deal with machine translation that users see, they spend tens and hundreds of millions of rubles on research. Even if we don't pay salary to anybody during the two years, the financing won't be enough to buy the necessary servers anyway.
Who else participates in the creation of the translator?
To create a tool for professional translators, we actively cooperate with ABBYY. In general, our institute's staff work – it's programmers, linguists, machine learning specialists. We also attract a group of translators just because, as I said at the beginning, we need a Tatar-Russian base with translations to train the system. To create this big base, we gathered all the news, all the documents that are available both in the Russian and Tatar languages. These data aren't enough because people translate texts from Russian to Tatar to replenish this base.
Talking about the website, we should note the specialists who deal with the design, code, web programming of the site. Our institute has 13 employees. We additionally attract other specialists if we can, if we have money.
''Even if we don't pay salary to anybody during the two years, the financing won't be enough to buy necessary servers anyway.'' Photo: iaas-blog.it-grad.ru
Where and how can this translator be used in the future?
The target audience is quite wide. Firstly, it's people who learn the Tatar language – schoolchildren, students, people who simply want to improve their Tatar language skills. Secondly, specialised use is presupposed, for instance, when preparing documentation, news.
What were the difficulties when you created the translator that translates from the Tatar language?
Difficulties can be divided into two parts. The difficulty of machine translation itself is an unsolved task and actively develops from a scientific point of view. In addition, there are difficulties linked with the specifics of the Tatar language. Every Tatar stem can have a big number of affixes, that's to say, word building is very rich. We used methods that allow to not only bypass this problem but also use this peculiarity as an advantage, so that the translator will work better not with standard methods but methods adapted to the Tatar language. There is such a moment that the same Russian-English pair has a big volume of training materials. It has never been created for the Tatar-Russian one, this is why our job in this direction is especially important. The total volume on the Internet, in books, in magazines is insufficient at the moment to compare Tatar with other big world languages. Even if we use the latest technologies, the quality anyway depends on those data we have. And the accumulation of these data: collection, translation, is very hard and long work.