爱译网logo 智能,研发,应用,推广  
           
Member name: Password: 注册
中文版
英文版
www.aitrans.net
AITRANS.NET--A HIGHWAY TO THE PALACE OF INTELLIGENCE AND WEALTH FOR TRANSLATORS AND READERS。让计算机模拟人的翻译思维,存储专业人士的高智力翻译成果,共建共享专业型智能化知识库,建立国际性智能翻译知识库标准,实现高质量的全自动机器翻译。
 
关于爱译网
客户服务
翻译论坛
下载专区
Home > AI Introduction > Translation Technology >Statistical-based MT
 

Statistics-based MT

 

The method of Statistics-based MT is to consider machine translation as a process of information transmission, with a channel model to explain the machine translation. This idea that translating the source language sentence to the target language sentence is a probability problem, and any target language sentence is likely to be the translation of any source language sentence, so it’ just a problem of different probability. And the task of machine translation is to find the sentence of maximum probability. The specific method is to consider translating as a decoding process from original text to translation by means of model conversion. Therefore, statistical machine translation can be divided into the following questions: model problem, training problem and the decoding problem. The so-called model problem is to establish the probability model for machine translation, that is, to define the calculation method of translation probability from source language sentence to the target language sentence. The problem of training is to get all parameters of this model by using corpus. The so-called decoding problem, on the basis of known models and parameters, finds the translation of maximum probability for any input source language sentence.

 

In fact, the method of using statistics to solve machine translation problem is not a new idea of the 1990s. In 1949 W. Weaver had proposed this method in that machine translation memorandum, but because of the criticism of Chomsky (N. Chomsky) and others, this approach was soon abandoned. The main reason of criticism is: language is infinite, empirical-based statistical description can not meet the actual requirements of the language.

 

In addition, confined to computers speed at the time, the value of statistics is out of the question. Now, whether the speed or the capacity of the computer has been greatly improved, the work that only can be done by the old mainframe computer can be completed by a small workstation or personal computer today. In addition, the successful application of statistical methods in the fields of speech recognition, text recognition and lexicography also shows that this method is still very effective in the field of language automatic processing.

 

The mathematical model of statistical machine translation method is proposed by the researchers of the International Business Machines Corporation (IBM). Five word to word statistical models were proposed in the famous article “The Mathematical Theory of Machine Translation”, known as the IBM Model 1 to IBM Model 5. These five models originate from source channel model, using maximum likelihood to estimate the parameters. Because of the limitation of calculation conditions at that time (1993), the large-scale data-based training could not be achieved. Subsequently, the Hidden Markov Model-based statistical model proposed by Stephan Vogel was also taken seriously, and the model was used instead of IBM Model 2. At this point of the study, the statistical model only considered the linear relationship between words, without considering the structure of the sentence. The effect may not be very good if the word order difference of two languages is too large. If syntactic structure or semantic structure be taken into account when considering the language model and translation model, it should get better results.

 

6 years after this article was published, at the MT Summer Camp of John Hopkins University, a group of researchers implemented GIZA software package. Franz Joseph Och subsequently optimized of the software, and accelerated the training speed, especially the training of IBM Model 3 to 5. He also proposed a more complex Model 6. The package released by Och was named as GIZA ++, and until now, GIZA ++, has been the cornerstone of most of the statistical MT system. Aiming at the training of large-scale corpus, there exist a number of GIZA ++ parallel versions.

 

The performance of word-based statistical MT was limited due to small modeling unit. Therefore, many researchers were turning to the phrase-based translation methods. Franz-Josef Och proposed maximum entropy-based model discriminated training methods to improve the performance of statistical machine translation greatly. In the next few years, the performance of this method was far ahead of other methods. A year later Och modified the optimization criteria of the maximum entropy method, directly optimizing objective evaluation criteria, thus the Minimum Error Rate Training which is widely used today came into being.

 

Another important invention promoting the further development of statistical MT is automatic objective evaluation method, providing automatic evaluation for the translation results, thus avoiding the tedious and expensive manual evaluation. BLEU evaluation index is the most important evaluation. Most researchers still use BLEU as the primary criteria of evaluating its research results. Moses is a well-maintained open-source machine translation software, developed by researchers from Edinburgh University. The release makes the previous cumbersome and complex processing simple.

 

Currently, Google’s online translation is well known and the underlying technology is the statistics- based l MT method. The basic operating principle is to search a large number of bilingual web contents as corpus, then the computer automatically selects the most common correspondence between words, then translation results are given. Undeniably, the technology Google uses is advanced, but it often makes all kinds of “translation of jokes”. The reason is: the statistics-based method requires large-scale bilingual corpus, the accuracy of translation model and language model parameters directly depends on the number of corpus, and the translation quality depends mainly on the probability model and the corpus coverage. Although statistics-based method doesn’t depend on a lot of knowledge and direct relying on statistical results to do ambiguity resolution processing and translation selecting avoids several difficulties in language understanding, corpus selecting and processing are tremendous volume of works. Therefore, the current machine translation systems of general fields are rarely statistics-based.

COPYRIGHT 2010 AITRANS, ALL RIGHTS RESERVED. 京ICP备9035536号

Hotline:86-010-82893875    E-mail:info@aitrans.net

registration number:京ICP备18027361号-2