星期二, 10月 25, 2005

NEC Develops Speech-to-Speech Translation Software

NEC Develops Speech-to-Speech Translation Software for Low Power Consumption Multi-Core Processors Optimal for Small Devices such as Mobile Phones

Tokyo, October 24, 2005 --- NEC Corporation today announced that it has succeeded in the development of Japanese-English/English-Japanese, automatic speech translation software for single-chip multi-core processors for small devices such as mobile phones, capable of operation at high speeds with low power consumption. NEC verified the high-speed automatic speech translation processing capability of this software on NEC Electronics' MP211 (note 1*) application processor for mobile phones, at an operating frequency of 200MHz, proving that operation of interpretation applications is technologically feasible on small devices like mobile phones.

Supporting a 50,000-word rich vocabulary, this software realizes automatic speech-to-speech interpretation of travel conversation through the development of a new parallel speech recognition method (
note 2*) for single-chip processors with several CPU cores, and a compact, lexical-rule-based, machine translation engine that unites dictionaries with grammar (note 3*) that is operable on small devices.

The features of this software include:

(1) A parallel, large-vocabulary, continuous speech recognition engine, which is built with a database consisting of a wide-range of conversation sounds and words that enables accurate speech recognition of spoken words.

(2) A lexical-rule-based, machine translation engine, which achieves high-performance translation of spoken words utilizing dictionaries/grammar, compiled from a wide range of language knowledge data.

(3) An advanced wave-concatenative speech synthesis engine, which realizes high-performance reading through an advanced, wave-concatenative speech synthesis method based on a wide-range of speech data.

(4) A total integration module that controls collaborative operation of the speech recognition engine, the machine translation engine, and the speech synthesis engine realizing automatic translation on a single processor for mobile phones.

With the advancement of an information society and increased freedom of movement across borders, the dynamic development of technology supporting automatic speech interpretation and translation to support communication between different languages is rapidly progressing.

NEC's developments in this area include:
Automatic Japanese – English/English – Japanese translation software for notebook PCs in 1999

Commercial launch of "Tabitsu" (American English version),communication software supporting English travel conversation, in 2001

PDA-operational Japanese – English travel conversation, automatic speech translation software in 2002

The next natural development for NEC was to expand this technology to small, light-weight portable devices that can be used anytime, anywhere. However, in order to achieve this goal it was necessary to realize large CPU power, required for speech recognition, and machine translation technology for interpretation, which are both exceedingly difficult to achieve on low-power multi-core processors for small devices such as mobile phones. NEC has accomplished this development through the synthesis of its proprietary parallel speech recognition technology and its compact machine translation technology with its multi-core processor technology.

NEC will continue to advance research of its speech recognition and language processing technologies toward the realization of a society where communication is possible anytime, anywhere.
-------------------------------------------------
Note:

With the combination of three ARM926EJ-S CPU cores and NEC Electronics' digital signal processor, the single-chip MP211 processor offers high-performance parallel processing capabilities optimized for applications such as mobile phones that are sensitive to power consumption requirements.

Through adoption of an acoustic look-ahead technique that can reduce word-search space, NEC's proprietary speech recognition method realizes acceleration of the entire interpretation process, dividing it into three steps comprised of recognition processing, reconstruction and maintenance of accuracy.

This is a translation engine based on dictionary storing of each word's lexical rules that can easily expand both general translation rules and individual translation rules for fixed form expression, and which realizes excellent software downsizing and enhancement of translation quality.

星期一, 10月 17, 2005

Trados 7.0 搶先體驗

一、首先,從軟體版本相容性方面說起。

1. Trados 7.0

剛下載的7.0版本爲7.0.0.615 (如圖1),使用 license 進行授權。我沒有卸載6.5版本,直接安裝Trados7.0,發現6.5版本的Trados Translator's Workbench 不能再使用了。但是原來創建的 TM 庫仍然保留在 Trados 7.0 中。

圖1

如果再次運行 TWB, 版本資訊會變成原來 6.5 的資訊。如圖2。

圖2

還好,Trados 7.0 帶有一個 license manager, 如圖3,可以用它來重新授權。

圖3

7.0可以使用三種方式獲得授權:

1. soft key,

2. dongle,

3. license。

原來的6.5是靠dongle來破解的,產生上面版本資訊問題的原因估計是再次運行Trados6.5 TWB,程式再次去載入6.5 的dongle資訊造成的。  

Trados 從5.5/6.0到6.5升級時,如果先安裝6.5,再安裝6.5之前的版本,也必須再次運行6.5的dongle才能將版本恢復到6.5。  

經過測試,發現Trados產品有一點很有趣,5.5可以和6.5共存,5.5可以和7.0共存,但是6.5和7.0就有衝突,不能在系統中同時安裝。

2. TRADOS MultiTerm 7 Desktop

現在的 Multiterm 7的版本是 7.0.1.320,如圖4。

圖4

MultiTerm 7 在安裝時就會提示,原系統中有 Multiterm 的舊版本存在,必須先卸載原來的Multiterm IX 版本才能 繼續安裝。MultiTerm 7 會提示用戶先備份原來的 Termbase, 然後再卸載。

安裝MultiTerm 7 的步驟基本和原來的MultiTerm安裝步驟一致,只不過多了一步尋找 license。由於剛安裝完 trados 7.0,MultiTerm 7 會自動載入 license。  

安裝完成前,MultiTerm 7 還會提示用戶是否恢復備份的 Termbase,如果選擇“是”,就可以將原來 Multiterm IX 中的 Termbase 恢復到新的版本中。(不知道是什麽原因,haha 恢復的 Termbase 不能用,提示:“Microsoft Jet 資料庫引擎找不到輸入表或查詢 "mtIndexes",如圖5)

圖5

和trados 7.0同樣情況,MultiTerm 5.5 可以和MultiTerm IX 共存,MultiTerm 5.5可以和 MultiTerm 7.0共存,但是 MultiTerm IX 和 MultiTerm 7.0 不能在系統中同時安裝。

二、再談一談功能方面  

產品的很多功能都有增強,在這裏就不一一細說了,大家瀏覽一下産品新功能的介紹就行了。haha 只想談一談和我們譯員們最息息相關的,或者讓 haha 感受最深刻的一些新變化。

1. Trados TWB   

1) trados 7.0 中的TWB可以使用“多 Termbase” 了,如圖 6。

圖6

這可真是 trados 的大進步。多年來 trados 的一個專案只能載入一個 termbase,如果需要在翻譯過程中轉換 termbase, TWB 還經常會死掉,這一點原來一直讓 DejavuX 看笑話。現在的 trados 7.0 終於可以出一口氣了。

2) trados TWB 的 maintenance 終於可以“繼續搜索”了。如圖7

圖7

原來在維護 TM 時總是會覺得很麻煩,必須記住原來維護到了第幾頁,第幾個 TU,以便於下次再維護時從上次結束的地方開始。即使這樣,在維護一個大的TM庫時仍然會找不到,這一點非常不方便。現在trados 7.0 加上一個 pointer,每次維護完,下次可以繼續搜索,方便多了。     

2. Multiterm 7.0    

除了介面漂亮了之外,Multiterm 7.0 終於在主介面和工具欄上添加了 Entry 控制按鈕了。如圖8

圖8

在此之前 的Multiterm 版本中沒有 Entry 控制按鈕,用戶只能通過 F3和F10來手動添加辭彙。非常不方便,而且不易於新手入門。現在的 Multiterm 7.0 終於可以算成是一個獨立的軟體了。    

3. TagEditor     

歷來 TagEditor 都是作爲 Trados 的輔助工具出現的,其目的主要是做本地化專案,即翻譯 Microsoft Word 以外的一些帶有標記(Tag)的文檔。Trados 7.0 中的 TagEditor 開始自立了,TagEditor 現在可以以“所見即所得(WYSIWYG)”方式翻譯 word.doc 文檔了。如圖9。  

圖9

Trados努力開發 TagEditor的功能,也是爲了儘量擺脫 微軟的控制。當然,Trados 沒有做的那麽絕,TWB 還可以接合著 Word 來用。    

哈哈,囉嗦了那麽多,Trados 7.0 確實進步了很多,haha 只是測試了一下,說了說自己的一點小體會。Trados 7.0 還有更多更好的功能等待著大家去發掘哪!~

---------------------------------------------------

原創:「翻譯中國」哈哈站長 。在此表示感謝
http://www.fane.cn/forum_view.asp?forum_id=42&view_id=14672

口語機器翻譯系統

• ATR-ITL口語翻譯系統:近年來,國外開始自動翻譯電話的研究,在日本關西地區成立了自動電話研究所(Advanced Telecommunications Research Institute International – Interpreting Telecommunications Research laboratories, 簡稱ATR-ITL),其目的在於把語音識別、語音合成技術用於機器翻譯中,實現語音機器翻譯。1989年,日本ATR研製了SL-TRANS系統。

• SpeechTrans系統和JANUS系統:由美國卡內基-梅隆大學(CMU)研製。

• KITANO系統:90年代初期,日本學者北野(Kitano)在京都大學期間,使用大規模平行計算,採用基於實例的方法進行語音翻譯實驗,證明了毫秒級的即時口語語音翻譯是可實現的。

• Verbmobil計劃:由德國聯邦政府教育、科學、研究與技術部(BMBF)支援,其目的在於“通過工業及科學界盡可能多的分支領域的合作與集中,在下一個世紀的語言技術及其經濟應用領域中為德國謀取國際領先地位”。

• Verbmobil制定了1993-2001年的研製計劃,其中自1993年至1996年的第一階段計劃吸收了德國、美國和日本的32個企業和高等學校的成員參加,政府投入資金4690萬馬克,企業投入資金310萬馬克,第一階段的目標是建立非特定人的、面向會面安排交談的口語語音翻譯系統。





C-STAR計劃:1991年成立了國際口語翻譯聯盟(Consortium for Speech Translation Advanced Research, 簡稱C-STAR)。C-STAR是一個以口語語音翻譯爲基本研究目標的國際合作組織,由來自12個國家的20個成員組成。

• 核心成員有來自7個國家7個單位:美國的卡內基-梅隆大學(CMU)、日本的ATR-ITL、德國的卡爾斯魯爾大學UKA (University Karlsruhe)、法國格勒諾布林大學自動翻譯研究中心GETA-CLIPS、義大利的科學技術研究所ITC-IRST、韓國的高級網路服務技術部ETRI、中國科學院自動化研究所國家模式識別重點實驗室(NLPR)。其他成員有德國西門子公司(Siemens)、香港科技大學等。

• C-STAR把多種語言的口語直接翻譯作爲一個科學工程來進行,通過建立平臺和演示來推動口語語音翻譯技術的迅速發展,使C-STAR成爲國際口語翻譯技術轉向工業應用的搖籃,以掃除人類的語言障礙。

• 作爲C-STAR核心成員的中國科學院自動化所NLPR已經建立了口語翻譯的試驗系統的相關平臺,完成了一個面向會面安排的漢英口語語音機器翻譯原型系統EasySchedule,正在開發可初步實用的漢英口語語音機器翻譯系統。

星期日, 10月 02, 2005

中國機器翻譯技術新突破

在剛剛結束的「國際口語翻譯研究聯盟(C-STAR Ⅲ)」組織的國際機器翻譯核心技術評測中,中國科學院自動化所「網絡內容管理與信息服務」團隊提交的中—英翻譯系統取得了BLEU得分0.528,NIST得分10.25的最好成績。

接近人工翻譯水準

C-STAR是國際上最早從事口語翻譯的國際性組織,這次中─英翻譯評測吸引了包括美國IBM公司、日本ATR、德國亞琛大學、意大利ITC、日本NTT等十二個著名研究機構參加。

據介紹,由中文到英文人工翻譯句子的BLEU得分一般在0.5~0.6,而機械翻譯能取得0.528分,說明該所的翻譯技術在評測應用場景下的翻譯結果已經接近人工翻譯的水平。

長期以來,科技資料的翻譯是科研機構、大學、情報部門以及大型企業的重要工作之一,隨著國際交往的增多,資料翻譯也顯得越來越重要。特別是對於一些大型的引進項目,其外文資料往往數以噸計,這些資料若僅靠人工翻譯,難度可想而知,並且不適應規模化生產,因而,依靠機器翻譯就顯得非常必要了。機器翻譯的發展史表明,伴隨著信息技術的發展以及全球網絡的一體化趨勢,機器翻譯技術得以不斷提高,翻譯軟件的輔助翻譯作用愈發明顯。

據了解,目前,機器翻譯軟件有上百種,根據這些軟件的翻譯特點,大致可以分為三大類﹕詞典翻譯類、漢化翻譯類和專業翻譯類。

機器翻譯領域佳作

詞典類翻譯軟件堪稱是多快好省的電子詞典,它可以迅速查詢英文單詞或詞組的詞義,並提供單詞的發音,為用戶了解單詞或詞組含義提供了極大的便利。漢化翻譯軟件的「智能漢化集成環境」,則為不會英語或英語水平不高的人提供了「語言障礙的全面解決方案」,包括內碼轉換、動態漢化和電子詞典等,很好地滿足了用戶漢化英文軟件、英文網頁,實現對屏幕英文信息的了解和文章的初步翻譯等,對信息獲取、了解文章大意有相當實際的作用。而專業翻譯系統,則專門面向專業或行業用戶。

根據國際上有關專家的分析,機器翻譯要想達到類似人工翻譯一樣的流暢程度,至少還要經歷十五年時間的持續研究。也就是說,在人類還無法明瞭「人腦是如何進行語言的模糊識別和判斷」的情況下,機器翻譯要想達到百分之一百的準確率是不可能的。即使如此,中國科學院自動化所的這套新研製的機械中─英翻譯系統,以接近於人工翻譯的最佳水準,還是為目前的機器翻譯領域奉獻了一篇佳作。

星期四, 9月 08, 2005

玄奘翻譯佛經的10個步驟

玄奘翻譯佛經的步驟, 我們至今還採用, 在網上找到了一篇英文文章, 特張貼如下, 供大家參考:

Xuan Zang, Possibly China's Greatest Translator

His 10-stage quality control process initiated more than 1300 years ago is far more thorough and exacting than any existing today.

Introduction

Every Chinese, young and old, within and outside China, knows the classical language rendering of the exploits of Xuan Zang, the pious Tang dynasty monk and his three storybook disciples: the indestructible Monkey King, the Great Sage; Brother Pig, the Eight Denials (of Buddhism); and Sand Monk, the third disciple. In real life, Xuan Zang was a truly remarkable Buddhist monk. He travelled on land, across mountains and deserts, through hostile and uncharted territories, to the birthplace of Buddha in the Indian sub-continent and thereafter returned to Chang’an (modern day Xi’an) with a set of Buddhist sutras. The voluminous sutras were written in the extremely difficult Sanskrit language. Together with his doyens of pupils, he completed the translation of some 75 volumes of the sutras into an equally difficult Chinese languages.

The 10 Stages

Buddhist sutras, translated into Chinese earlier than the Tang dynasty, were difficult to read and comprehend because the responsible translators were all Buddhist monks of non-Han Chinese origin. It took China several hundred years to groom its own selection of Buddhist monks who could master both the Sanskrit language and the complex Buddhism doctrines, which were written in Sanskrit. And Xuan Zang was recognized as the foremost among them. He was appointed chief of the Tang dynasty imperial translation centre. It was he who designed and implemented a translation workflow that would guarantee the quality of the final product. The process detailed below is well worth adopting by any modern day translation team.

Stage 1

Master Translator and Buddhism expert to jointly study and interpret the original text written in Sanskrit. It could involve one or more persons.

Stage 2

Members of the translation team to attend a recital of the text in question by the Master Translator. The purpose is to verify the accuracy of the interpretation undertaken in Stage 1. A recital is necessary because the scriptures were originally written for recitation.

Stage 3

A team of junior translators produces the first Chinese draft from the Sanskrit text. The draft includes is a transliteration of Sanskrit terms into Chinese equivalents.

Stage 4

Production of a complete Chinese version by a senior Han Chinese Buddhist monk trained to undertake scripture translation. This is the most important stage in the entire process involving a monk with an in-depth knowledge of Chinese culture and language.

Stage 5

Refinement of the complete Chinese version, construction and structure of sentences. This is a necessary stage because of the vast linguistic differences between the source and target language.

Stage 6

Reverse translation of the Chinese version into Sanskrit in order to verify the accuracy in the interpretation of the original text. Mistakes in interpretation are to be promptly rectified in the Chinese version.

Stage 7

Review of the verified Chinese version to identify errors in usage of characters, and refinement in linguistic expressions to improve readability.

Stage 8

Further polishing to improve the literary beauty of the language, adding linguistic colours to the otherwise monotonous writing.

Stage 9

Verification of the audio quality by reciting the translation aloud. The audio effect is important because scriptures are for preaching aloud to an audience.

Stage 10

Final check by the Master Translator.

Conclusion

Xuan Zang did a great job in the translation of the Buddhist sutras. He was not only an outstanding linguist but he was also wholeheartedly committed to the task as a devout Buddhist possessing an extraordinary understanding of its contents. Indeed, he dedicated thirteen years of his life to the task. The task was not simply limited to transforming one language into another. In order to effectively spread Buddhism, the sutras in Chinese would have to be spiritually understood by the Chinese devotees. In today's context, he had to take into account the marketplace. Was it intended for the gentry for whom Buddhism depended for their financial support or was it for the less literate populace to widen market share or was it for both?

星期三, 8月 24, 2005

機器翻譯測試 Google最準

搜尋大廠Google力求要讓Web更加國際化的野心最近因美國政府所做的一份機器翻譯軟體測試而更上一層,擊敗了對手包括學界與IBM的軟體。

在阿拉伯文翻譯至英文,以及中文翻譯至英文的測試上,Google獲得美國國家科學技術學院(National Institute of Science and Technology)的最高分。每一道測試包含翻譯100篇涵蓋從法新社(AFP)至新華社的新聞文章,日期從2004年12/1日2005年1/24日。測試結果已在本月稍早公布。

過去,電腦化翻譯的品質一直為人所詬病,但隨著運算性能的增加,加上資料樣本數更大,科學家已經有辦法改善機器的翻譯精確度。

例如,新創公司Language Weaver就寫出一種可翻譯半島電台(Al Jazeera)廣播的軟體。包括卡內基美隆大學(CMU)的語言科技研究所在內的多所大學都有此一領域的專門研究(但上述兩家今年都沒參加此次測試)。

Google的機器翻譯雖不完美,但卻足以領先對手甚多。以滿分1分來計算,Google的阿拉伯文翻譯得分0.5137,中文則得分0.3531。排名第二的是南加大資訊科學學院,得分前者為0.4657,中文則為0.3073。IBM排名第三,前者.4646,中文則為.2571。

其他參與者還包括英國愛丁堡大學(University of Edinburgh)、以及中國哈爾賓工業大學。NIST表示多數參加測式的軟體都是來自研發實驗室。

Google勝出的優勢可能是來自於該公司網羅了龐大的資料來源。一般而言,電腦翻譯軟體會隨著資料匯入的多寡而有表現上的差異。透過本身的搜尋業務,Google蒐集了上億的翻譯網頁。

Google跟Yahoo一樣,都將新客戶來源瞄準開發中國家。Google在自家網站中包含一些機器翻譯工具,並同時擁有多種國際版本。

CNET新聞專區:Michael Kanellos
23/8/2005

星期一, 8月 15, 2005

《電腦輔助翻譯通訊》創刊號

創刊詞
The Inaugural Issue of CAT Bulletin
- Message from the Editor


《電腦輔助翻譯通訊》是香港中文大學翻譯系電腦輔助翻譯碩士課程與校內外各界人士溝通的刊物,一方面提供課程結構、科目內容、教師資歷、學術活動、公開講座、論著出版、研究成果等各方面的資訊,另一方面亦會透過各種形式,例如學術會議的論文、「翻譯技術研討會」的講詞、及修讀學生的專題作業,來介紹電腦輔助翻譯的最新發展,讓大家對電腦輔助翻譯的範疇及本系的碩士課程都有較深入的認識。

CAT Bulletin is published to facilitate the dissemination of information relating to the Master of Arts in Computer-aided Translation Programme of the Department of Translation, The Chinese University of Hong Kong, to targeted readers. It includes information on programme structure, course contents, staff profiles, academic activities, public seminars, staff publications, and research findings. It also provides information on new advances in computer-aided translation through the presentation of conference proceedings, speeches at the seminars on translation technology, and students’ translation projects. It is hoped that there will be a better understanding of computer-aided translation and our M.A. programme through the publication of CAT Bulletin.

電腦輔助翻譯是一個發展迅速的領域。電腦輔助翻譯系統亦在世界各地廣泛應用,成為翻譯行業的一項重要工具。從事翻譯工作的人士需要認識這個大趨勢,接受適當的訓練,提升翻譯技術的能力,才可以應付未來的挑戰。《電腦輔助翻譯通訊》將以季刊形式報道翻譯系在這方面所作的努力及課程發展的情況。

Computer-aided translation is a fast-growing area. CAT systems are also widely used in different parts of the world, and have become an essential part of the translation profession. Translators need to realize this global trend and receive proper training in computer-aided translation to enhance their skills in translation technology in order to
meet the challenges in the future. CAT Bulletin, a quarterly, will report on the efforts we make in this direction and the development of our MACAT programme.

星期二, 8月 09, 2005

机器翻译的需求与发展

Types of translation demand
几种翻译需求

When giving any general overview of the development and use of machine translation (MT) systems and translation tools, it is important to distinguish four basic types of translation demand. The first, and traditional one, is the demand for translations of a quality normally expected from human translators, i.e. translations of publishable quality – whether actually printed and sold, or whether distributed internally within a company or organisation. The second basic demand is for translations at a somewhat lower level of quality (and particularly in style), which are intended for users who want to find out the essential content of a particular document – and generally, as quickly as possible. The third type of demand is that for translation between participants in one-to-one communication (telephone or written correspondence) or of an unscripted presentation (e.g. diplomatic exchanges.) The fourth area of application is for translation within multilingual systems of information retrieval, information extraction, database access, etc.

当我们讨论机器翻译系统和翻译工具的发展时,首先需要区分四种基本的翻译需求:第一是传统型,它要求翻译结果和人(翻译家)翻得一样好,即翻译结果达到出版水平;第二种需求对翻译质量的要求稍低一些,尤其是对文体的要求较低,用户这时最感兴趣的是了解某篇文章的基本内容,因此希望翻译速度越快越好;第三种需求是对话双方一对一的交谈(打电话或者在Internet聊天室里聊天)或无需写在纸上的演讲(如外交场合的谈话);第四种需求是在信息检索、信息抽取、数据库访问等多语言系统里所需进行的翻译。

The first type of demand illustrates the use of MT for dissemination. It has been satisfied, to some extent, by machine translation systems ever since they were first developed in the 1960s. However, MT systems produce output which must invariably be revised or ‘post-edited’ by human translators if it is to reach the quality required. Sometimes such revision may be substantial, so that in effect the MT system is producing a ‘draft’ translation. As an alternative, the input text may be regularised(or ‘controlled’ in vocabulary and sentence structure) so that the MT system produces few errors which have to be corrected. Some MT systems have, however, been developed to deal with a very narrow range of text content and language style, and these may require little or no preparation or revision of texts.

第一种机器翻译需求是为了传播思想。自机器翻译系统出现之日起,这种需求可以说在某种程度上得到了满足。然而,要想达到用户需要的质量,机器翻译输出结果常常还需要由翻译家修改或进行"译后编辑"。在很多情况下,这些修改都是必需的,因此机器翻译系统实际上只是产生了一个"草稿型"译文。如果要减少后续的修改,就必须在翻译前对输入文件进行规整,对所用词语和句子结构进行"限制",使机器翻译系统不至于产生太多必须修改的错误。

The second type of demand – the use of MT for assimilation – has been met in the past as, in effect, a by-product of systems designed originally for the dissemination application. Since MT systems did not (and still cannot) produce high quality translations, some users have found that they can extract what they needed to know from the unedited output. They would rather have some translation, however poor, than no translation at all. With the coming of cheaper PC-based systems on themarket, this type of use has grown rapidly and substantially.

第二种需求是为了了解信息而使用机器翻译系统,这一需求实际上已经作为第一种需求的副产品得到了实现。既然机器翻译系统尚不能直接产生高质量的译文,因此用户能从未经编辑的译文中找出或猜出他们需要的东西也是很有帮助的,毕竟翻译出一部分总比一点没有翻译要好。在这种情况下,尽管机器翻译的译文结果很糟糕,但随着PC价格越来越低廉,这类机器翻译系统的需求量也大大增加了。

With the third type – MT for interchange – the situation is changing quickly. The demand for translations of electronic texts on the Internet, such as Web pages, electronic mail and even electronic ‘chat’ lists, is developing rapidly. In this context, the possibility of human translation is out of the question. The need is for immediate translation in order to convey the basic content of messages, however poor the input. MT systems are finding a ‘natural’ role, since they can operate virtually or in fact in real-time and on-line and there has been little objection to the inevitable poor quality. Another context for MT in personal interchange is the focus of much research. This is the development of systems for spoken language translation, e.g. in telephone conversations and in business negotiations. The problems of integrating speech recognition and automatic translation are obviously formidable, but progress is nevertheless being made. In the future – still distant, perhaps – we may expect on-line MT systems for the translation of speech in highly restricted domains.

第三种需求是以交流信息为目的的机器翻译。由于信息更新速度很快,不可能由人来翻译,用户需要马上得出翻译结果以便传达信息的基本内容。例如基于Internet的在线翻译系统,它能实时进行翻译,但翻译质量难尽人意。有些机器翻译系统目前正在探索如何"自然"地扮演自己的角色。另一种用于人际交流的机器翻译系统是口语翻译系统,它可以用在电话交谈、商务会谈等场合。目前有很多专家正在研发这类系统,其难点在于语音合成和自动翻译。这一领域的研究尽管进展缓慢,但我们仍然可以希望将来在非常受限的领域里应用在线口语机器翻译系统。

The fourth type of MT application – as components of information accesssystems – is the integration of translation software into: (i) systems for the search and retrieval of full texts of documents from databases (generally electronic versions of journal articles in science, medicine and technology), or for the retrieval of bibliographic information; (ii) systems for extracting information (e.g. product details) from texts, in particular from newspaper reports; (iii) systems for summarising texts; and (iv) systems for interrogating non-textual databases. This field is the focus of a number of projects in Europe at the present time, which have the aimof widening access for all members of the European Union to sources of data and information whatever the source language.

第四种机器翻译需求是信息访问系统提出的。在这里,机器翻译软件被集成到一系列子系统中,这些子系统包括如下几类:

1、 数据库的全文搜索和检索系统,一般是科学、医学和技术期刊杂志的电子版,或文献信息检索系统;
2、 从文本,特别是新闻报道中提取信息;
3、 对文本进行综述的系统;
4、 查询非文本数据库系统。

目前,这方面有几个项目正在欧洲进行,目的是使所有欧盟成员国都能访问数据和信息源,无论用什么源语言。

Future needs and developments
未来需求及发展

Despite the recent growth of systems for personal computers and of Internet services, it is still true to say that there is nothing yet really suitable for the independent professional translator, i.e. for those not working for large companies or in translation organizations. It is known that some translators have tried to apply commercial PC-based software to their needs, but the amount of adaptation required and the generally poor output has made them unsatisfactory and uneconomic. More suitable for the independent translator would be a cost-effective translation workstation. However, current workstations on the market are still too expensive for the individual translator. Although there is promise of low-cost computer tools for this potentially large market – e.g. terminology and concordancing software, and perhaps alignment software – there is no doubt that this segment is not being covered as well as many other areas.

尽管近年来针对微机和Internet的机器翻译服务有上升趋势,但实事求是地说,还没有一个机器翻译系统特别适合于自由职业的翻译工作者,也就是那些既不隶属于一个大公司也不在一个翻译组织里工作的人。据调查,有些翻译工作者曾试图使用商用PC翻译软件,但需要进行"译后编辑"的工作量太大,机器翻译输出结果太差,无法满足他们的需求。尽管人们希望能针对这一潜在的大市场开发出低成本的翻译辅助工具,例如术语协调软件、对齐软件等,但目前还没有产品面市。

Another area at present poorly served is the need for reliable but low-cost translation of documents into unknown foreign languages where users do not want to engage expert bilingual translators. There is no problem with translation into recipients’ own languages – PC systems can give adequate ‘rough’ versions for users 12 to get some idea of the basic message – but for translation into an unknown language there are still no solutions. There have been recently some cheap Japanese products which serve this specific ‘foreign language authoring’ demand in the case of writing business letters (based on standard phrases and document templates), but for other areas and for longer documents, where there is less ‘stereotyping’, there is nothing as yet. For translation into another language unknown (or poorly known) by the sender, what is really required is software which can be relied upon to provide good quality output (and most PC products are not good enough). A number of research groups are investigating interactive systems, where the sender composes an MT-friendly version of a letter or document in collaboration with the computer. With a sufficiently ‘normalised’ input text, the MT system can guarantee grammatically and stylistically correct output. As yet, however, this work (e.g. at GETA in France) is still at the laboratory stage (Boitet and Blanchon 1995).

目前面临挑战的另一个应用领域是将用户的输入译成用户所知甚少或未知的外国语,这时用户并不想充当双语翻译家的角色。机器翻译系统可以给出大致"粗略"的译文,至少可以告诉用户大致说的是什么。但对那些不知道目标语言的翻译,目前还没有什么解决办法。最近日本研制出一些廉价的产品,可以对特定的"外语授权(foreign language authoring)"提供服务。例如,写一封商务信函(基于标准短语和文件模板),但对其他领域或较长的文件,因为"规矩套路"很少,所以还不能编写。目前有几个研究小组正在研究交互式系统,发送者按照模板要求编写文档,如果输入文件足够"正规化",机器翻译系统就能保证语法和语言风格的正确输出。

The same is true for software combining MT with information access,information extraction, and summarisation software. There are no commercial systems yet on the market; developments are still at the research stages. The potential and the demand has been recognised: for example, in recent years, most research funds of the European Union have been focused not on MT or ‘pure’ natural language processing (as it was during the 1980s), but on projects for multilingual tools with direct applications in mind; many involve translation of some kind, usually within a restricted subject field and often in controlled conditions (Hutchins 1996; Schütz 1996). As just one example, the AVENTINUS project is developing a system for police forces in the area of drug control and law enforcement: information about drugs, criminals and suspects will be available on databases accessible in any of the European Union languages.

同样,将机器翻译技术与信息访问、信息提取和文摘软件结合在一起的尝试也处于研究阶段,目前市场上还没有商用产品,但开发商已经意识到其潜在的市场。例如,AVENTINUS项目是专门为警察部队在辑毒和执法方面开发的,用欧盟任何一种语言都可以访问中央数据库并查询关于毒品、犯罪和嫌疑犯的信息。目前,世界各国对这类跨语言应用的兴趣越来越大。最吸引人的应用是"跨语言信息检索",即允许用户用自己的语言搜索外语数据库。在这一系统中,大部分工作集中于如何建立和操作合适的翻译字典,以便将查询词串与数据库文档中的词和词组相匹配。相信在不久的将来会有这方面的商用软件出现。

The future application that is probably most desired by the general public is the translation of spoken language. But, from a commercial (and even research) perspective, the prospects for automatic speech translation are still distant (Krauwer et al. 1997). It was only in the 1980s that developments in speech recognition and synthesis made spoken language translation a feasible objective. In Japan a joint government and industry company ATR was established in 1986 near Osaka, and it is now one of the main centres for automatic speech translation. The aim is to develop a speaker-independent real-time telephone translation system for Japanese to English and vice versa, initially for hotel reservation and conference registration transactions. Other speech translation projects have been set up subsequently. The JANUS system is a research project at Carnegie-Mellon University and at Karlsruhe in Germany. The researchers are collaborating with ATR in a consortium (C-STAR), each developing speech recognition and synthesis modules for their own languages (English, German, Japanese). (One by-product of this research was mentioned earlier: the rapiddeployment project for custom-built systems in less-common languages.)

The fourth major effort in speech translation is the long-term VERBMOBIL project funded by the German Ministry for Research and Technology which began in May 1993. The aim is a portable aid for business negotiations as a supplement to users’ own knowledge of the languages (German, Japanese, English). Numerous German university groups are involved in fundamental research on dialogue linguistics, speech recognition and MT design; a prototype is nearing completion, and a demonstration product is targeted for early in the next century.

未来还有一种应用是公众迫切需要的,这就是口语翻译。但从商业角度或者研究角度看,全自动口语翻译还是一件十分遥远的事情。20世纪80年代,语音识别和语音合成技术取得的进展使人们感到口语翻译是可行的目标。日本ATR 公司建立了一个自动语音识别中心,目标是开发一个依赖于讲话者的实时日英、英日电话翻译系统。这一系统开始是面向旅馆预定房间和办理会议注册手续,后来增加了其他一些口语翻译系统。JANUS系统是卡耐基梅隆大学与德国Karlsruhe公司的合作研究项目。研究者与ATR合作形成一个合作体(C-STAR),每个研究者开发其母语(英语、德语、日语)的识别和生成模块。
口语翻译可能是目前机器翻译研究中最富有创新意义的领域,吸引了最多的资金和公众注意力。但观察家们并不相信这一领域在近期能取得迅速进展,因为书面语机器翻译花了数十年才达到现在的水平。口语翻译方面的另一项努力始于1993年5月由德国科学技术部出资支持的VERBMOBIL项目。该项目的目标是开发一个便携式商务谈判的辅助工具,好几所德国大学参与了这项对话语言学、言语识别和机器翻译设计的基础性研究工作。目前系统原型的开发已经接近尾声,很快将有演示产品出现。

Comparison of human and machine translation
人与机器翻译的比较

From this survey it should be apparent that the application of computers to the task of translating natural languages has not been and is unlikely to be a threat to the livelihood of professional translators. Those skills which the human translator can contribute will continue always to be in demand. There is no prospect, for example, that machine translation could ever attempt the translation of literary or legal texts. By contrast, for the rough translation of electronic texts on the Internet there is no rivalry for machine translation – human translators cannot compete in terms of speed, even if they were prepared to undertake poor quality translation of ephemeral material.

审视机器翻译的发展与现状,我们可以看到,使用计算机进行自然语言翻译并没有也不可能对职业翻译家的饭碗有什么威胁。翻译家的翻译技巧将继续得到重视。例如,机器翻译从来没有也不敢试图涉猎文学或法律文件的翻译。与之相对应的是,在Internet上粗略翻译电子邮件文本方面也没有什么方法能与机器翻译相比--人在速度方面比不过机器,即使翻译家愿意承担这类毫无保留价值的并常常是写得很差的文件的翻译工作,也难与机器翻译软件匹敌。

We may compare the relative merits of human and machine translationaccording to the categories of need and use outlined at the beginning of this paper. As far as the dissemination function (production of publishable translations) is concerned, human translation is more satisfactory and less costly overall whenever it is a question of translating one particular text in a unique subject domain (whether scientific, technical, medical, legal or literary). Machine translation demands the costly investment of dictionary maintenance and updating and the costly involvement of post-editing. This can be justifiable (i.e. cost-effective) only when large volumes of documentation within a particular domain are being translated. It is even more justifiable if translation is into more than one target language (when pre-editing and/or vocabulary and grammar control of original texts is possible), and when there is considerable repetition. For such tasks, the human translator would be overwhelmed by the scale of the task, by the boring repetitiveness and by the need to maintain terminological consistency. By contrast, the computer can handle large volumes and can automatically maintain consistency. In brief, machine translation is ideal for large scale and/or rapid translation of (boring) technical documentation, (highly repetitive) software localisation manuals, and real-time translation of weather reports. The human translator is (and will remain) unrivalled for non-repetitive linguistically sophisticated texts (e.g. in literature and law).

我们可以根据几种翻译需求来比较人与机器翻译的相对优缺点。对于"传播思想"的需求来说,凡是需要翻译某个特定领域(科学、技术、医学、法律或文学)的某段特殊文字,由人工翻译的质量更可靠且成本较低。而另一方面,机器翻译的字典维护和更新以及译后编辑需要较高的成本,因此只有当需要翻译某个领域的大量重复性文件时才是划算的。对这类翻译任务,翻译工作者会望而却步,因为工作量太大,且重复度太高,而且还要保持术语的一致性。简而言之,机器翻译适合于处理大量的、重复度高的技术资料、软件本地化手册、实时天气预报等资料,而人工翻译在语言非重复性的复杂文本方面有着无可替代的作用。

For the translation of texts for assimilation, where the quality of output can be poorer than that for texts to be published, it is clear that machine translation is an ideal solution. Human translators are not prepared (and resent being asked) to produce ‘rough’ translations of scientific and technical documents that may be read by only one person who wants to merely find out the general content and information and is unconcerned whether everything is intelligible or not, and who is certainly not deterred by stylistic awkwardness or grammatical errors. Of course, they might prefer to have output better than that presently provided by most MT systems, but if the only alternative option is no translation at all then machine translation is fully acceptable.

对于为了解信息而需要翻译的情形,显然使用机器翻译比较理想,因为此时对翻译质量要求不高。翻译家不打算而且也很反感被要"粗略"地翻译科学技术资料。当一个人只是想大致了解一下某篇文章的内容,并不想知道该文的一切细节,而且他也并不讨厌看到译文文体拙劣、语法错误百出时,机器翻译足可满足这种需求。 对于信息交流来说,在未来的一段时间里,人工翻译在翻译商务信函方面将继续起着主要作用,尤其是翻译那些内容比较敏感或与法律有关的文件。但对个人信件来说,机器翻译可能会用得越来越多。而对电子邮件、网络页面的信息提取以及基于计算机的信息服务来说,机器翻译可能是唯一可行的解决方案。 对于口语翻译而言,口语翻译家将继续占领市场,因为还没有迹象表明自动口语翻译会取代外交和商贸领域的口译家。尽管人们正在开展在高度受限领域的电话翻译研究,且未来也有希望实现,但对大量电话交谈来说,不可能出现什么系统来代替口译家。

Finally, MT systems are opening up new areas where human translation has never featured: the production of draft versions for authors writing in a foreign language, who need assistance in producing an original text; the on-line translation of television subtitles, the translation of information from databases; and no doubt, more such new applications will appear in the future. In these areas, as in others mentioned, there is no threat to the human translator because they were never included in the sphere of professional translation. There is no doubt that MT and human translation can and will co-exist in harmony and without conflict.

机器翻译系统正在开拓人工翻译从未涉及的领域:帮助需要用外语写作的作家生成文章草稿、在线电视解说词翻译、翻译数据库信息等。也许将来还会出现更多的新应用领域,但这些应用不会对职业翻译家构成威胁,因为这些领域是职业翻译家未曾涉猎的。今后,机器翻译与人工翻译将会各司其职、和谐共存。

英文原文:
http://ourworld.compuserve.com/homepages/wjhutchins/Beijing.pdf

软件本地化常用术语

软件本地化行业有很多经常使用的行业术语,非行业人士或刚刚进入该行业的新人,常常对这些术语感到困惑。另外,软件本地化行业属于信息行业,随着信息技术的迅速发展,不断产生新的术语,所以,即使有多年本地化行业经验的专业人士,也需要跟踪和学习这些新的术语。
本文列举最常用的本地化术语,其中一些也大量用在普通信息技术行业。对这些常用的术语,进行简明的解释,给出对应的英文。

加速键或快捷键(accelerate key)。常应用在Windows应用程序中,同时按下一系列组合键,完成一个特定的功能。例如,Ctrl + P,是打印的快捷键。

带重音的字符(accented character) 。例如在拉丁字符的上面或下面,添加重音标示符号,例如,ä。对于汉字没有此问题。

校准(alignment) 。通过比较源语言文档和翻译过的文档,创建翻译数据库的过程。使用翻译记忆工具可以半自动化地完成此过程。

双向语言(bi-directional language) 。对于希伯莱语言或者阿拉伯语言,文字是从右向左显示,而其中的英文单词或商标符号从左向右显示。对于中文,都是从左向右显示。

编译版本(build) 。软件开发过程中编译的用于测试的内部版本。一个大型的软件项目通常需要执行多个内部版本的测试,因此需要按计划编译出多个版本用于测试。

版本环境(build environment) 。用于编译软件应用程序的一些列文件的集合。

版本健康检查(build sanity check) 。由软件编译者对刚刚编译的版本快速执行基本功能检查的活动,通过检查后,再由测试者进行正规详细测试。

级连样式表(cascading style sheet -CSS) 。定义html等标示文件显示样式的外部文档。

字符集(character set) 。从书写系统到二进制代码集的字符映射。例如,ANSI字符集使用8位长度对单个字符编码。而Unicode,使用16位长度标示一个字符。

简体中文,日文,韩文,繁体中文(CJKT) 。也可以表示为SC/JP/KO/TC或CHS/JPN/KOR/CHT,是英文Simplified Chinese, Janpanese, Korean, Traditional Chinese的简写。

代码页(code page) 。字符集和字符编码方案。对每一种语言字符,都用唯一的数字索引表示。

附属条目(collateral) 。软件本地化项目中相对较小的条目。例如,快速参考卡,磁盘标签,产品包装盒,市场宣传资料等。

计算机辅助翻译(Computer Aided Translation-CAT) 。计算机辅助翻译。采用计算机技术从一种自然语言到另一种语言自动或支持翻译的技术术语。

串联(Concatenation) 。添加文字或字符串组成较长字符传的方式。

控制语言(Controlled language) 。自然语言的子集,常用于技术文档的写作,采用更加结构化和易于理解的语言。

外观测试(cosmetic testing) 。应用程序用户界面测试。

桌面印刷(desktop publishing-DTP) 。使用计算机软件对文档、图形和图像进行格式和样式排版,以便打印输出的过程。

双字节字符集(double bytes character set-DBCS) 。用两个字节长度表示一个字符的字符编码系统。中文,日文和韩文都用双字节字符集表示。

双字节支持能力(double-byte enablement) 。在软件的国际化设计中,支持亚洲的双字节字符的输入、显示、输出等处理的能力。

动态网站(dynamic web site) 。组合使用数据库、样式表、脚本语言、HTML动态现实网页内容的网站。

可扩展标示语言(Extensible Markup Language-XML)。 语言分析用的语言,为了实现特定目的创建自定义的标示的语言。XML是SGML的子集,专为网页(Web)语言设计。

法语/意大利语/德语/西班牙语(French/Italian/Germany/Spanish-FIGS) 。软件本地化时要考虑的欧洲主要的语言。

全部匹配(full match) 。源文字部分与以前存储在翻译数据库工具中的句子完全相同。

功能测试(functional testing) 。通过运行软件,测试产品的功能是否符合设计要求。

模糊匹配(fuzzy matching) 。翻译记忆工具中用于识别文字字段与以前翻译的句子一致程度的方法。

全球化(globalization-g10n) 。为进入全球市场而进行的有关的商务活动。包括软件进行正确的国际化设计,软件本地化集成,以及在全球市场进行的市场推广、销售和支持的全部过程。

术语表(glossary) 。用于软件翻译/本地化项目的包含源语言和本地化语言的关键词和短语的翻译对照表。

硬编码(hard-coding) 。直接嵌入在程序源代码内的可以本地化的字符。硬编码字符不能很方便的本地化。

帮助编译器(help compiler) 。编译联机帮助文档的工具软件。将HTML等源文件和图像编译成可以搜索的二进制联机帮助文档。

热键(hot key) 。菜单命令和对话框选项中带有下划线的字母或数字。通过按下Alt键和下划线的字母或数字,可以机或命令和选项。

超文本标示语言(Hypertext Markup Language) 。SGML语言的子集。定义了一组标示符控制页面内容的显示方式。

输入方法编辑器(Input Method Editor-IME) 。通过按下键盘的多个键完成输入本地化文字的应用工具。对于汉字,常用的有微软拼音输入法,标准输入法,五笔字形输入法等。

国际化(internationalization-i18n) 。在程序设计和文档开发过程中,功能和代码设计能处理多种语言和文化传统,使创建不同语言版本时,不需要重新设计源程序代码的软件工程方法。

国际化测试(international testing) 。软件国际化的支持和可本地化能力的测试方法。

日文汉字(Kanji) 。来自汉字的单个日文字,有些与当前的汉字书写完全相同,但按照日文发音。

启动会议(kick-off meeting) 。新的本地化项目正式开始前的会议,一般由原软件开发商和本地化服务商中的项目组主要代表人员参加。主要讨论项目计划,各方责任,提交结果,联系方式等与项目紧密相关的内容。

分层图像(layered graphic) 。为了便于翻译,可以翻译的文字单独存放在文字层的图像。

重复利用(leverage) 。在翻译/本地化过程中,以前已经翻译的内容再利用和循环使用的方法。

语言测试(linguistic testing) 。对本地化的产品执行与语言有关的内容的测试活动。

本地化行业标准组织(Localization Industry Standard Association-LISA) 。1990年在瑞士成立,成为本地化和国际化行业的首要协会,目前已经加入的会员超过400多家。LISA的目标是促进本地化和国际化行业的发展,提供机制和服务,使公司间能够交换和共享与本地化、国际化和相关主体相关的流程、工具、技术和商务模型信息。

本地化(localization-l10n) 。将一个产品按特定国家/地区或语言市场的需要进行加工,使之满足特定市场上的用户对语言和文化的特殊要求的软件生产活动。

本地化工具包(localization kit) 。由软件开发商提供的包好文件、工具和指导文档的系列文件集。本地化项目开始前,软件开发商提供给本地化服务商。

本地化测试(localization testing) 。对本地化的软件进行语言和用户界面测试,以保证软件本地化质量的活动。

本地化服务商(localization vendor) 。提供本地化服务的机构,包含软件翻译、软件工程、测试和项目管理等活动。

机器翻译(machine translation-MT) 。利用术语表、语法和句法等技术,自动实现从一种人类语言到另一种语言的翻译的方法和技术。

标识语言(markup language) 。与文字结合的标识和标签集合,应用程序(例如HTML网页浏览器)将处理这些标识并以正确的形式显示出来。

多字节字符集(multi-byte character set) 。每个字符用单个字节或两个字节表示的字符集。

多语言服务商(multi-language vendor-MLV) 。提供多种语言软件本地化能力的服务商。大多数多语言服务商主要集中在多语言项目的项目管理上,它们在全球范围内由多个分公司和合作伙伴。

国家语言支持(national language support-NLS) 。允许用户设置区域等软件功能。识别用户使用的语言、日期和时间格式等信息。也包括支持键盘布局和特定语言的字体。

外包(outsourcing) 。对软件本地化而言,将某些本地化任务交付给第三方的活动。源软件开发商将软件本地化项目交付给本地化服务商,很多本地化服务商,将翻译工作交给自由翻译人员。

便携式文档格式(Portable document Format-PDF) 。由Adobe公司开发的基于Postscript标准的文件格式。PDF文件可以由其他软件创建,主要用于电子文档的发布。

伪翻译(pseudo translation) 。将软件中的可以翻译的字符串用长的本地化的字符代替的自动或手工处理的过程,主要用于发现编译和执行本地化文件时潜在的问题。

质量保证(quality assurance-QA) 。保证最终产品质量的步骤和流程。

报价单(Request for quotation-RFQ) 。软件开发商发送给本地化服务商的包含项目内容和报价的报价单。

投入回报率(Return of Investment-ROI) 。判别项目投入费用和受益回报的指标。

调整(resizing) 。调整翻译后的对话框的元素,如按钮、列表框、静态控件等的大小和位置,保证翻译后的字符的显示完整和美观。

资源动态链接库(Resource-only .dll ) 。包含可以本地化的资源,例如,菜单、对话框、图标、屏幕提示字符的动态链接库。

屏幕捕捉(screen capture, screenshot) 。使用图像截取软件截获菜单和对话框等软件界面的过程。

项目总结报告(Post Project Report-PPR) 。由软件开发商和软件本地化服务商的项目主要成员填写的关于项目执行情况、发现的问题和建议的项目文档。在完成本地化项目后填写此报告。

校对(proofreading) 。对翻译的文档内容进行语言和格式进行检查的过程,一般由本地化公司内部具有丰富经验的编辑进行校对。

复审(review) 。对翻译的文档内容进行语言检查的过程,一般由本地化公司聘任的具有丰富经验的外部产品专家或软件开发商的语言专家复审。

简体中文(Simplified Chinese-SC) 。主要用于中国大陆和新加坡的中文汉字,与繁体汉字相比,笔划更简捷。

同时发布(simutaneous ship-simship) 。源语言软件与本地化的软件同时发布。要达到同时发布,软件本地化必须与软件开发同步进行。

单语言服务商(single language vendor-SLV) 。只能提供一种本地化语言服务的本地化服务商。

软件一致性检查(software consistency check) 。一种质量保证步骤。翻译者对比翻译的软件界面字符与联机帮助和文档文件字符翻译的一致性。

标准通用标识语言(Standard Generalized Markup Lanugag-SGML) 。一种信息交换的国际标准。在文档中使用规定的标识定义文档的三层标准格式:结构、内容和样式。

术语管理系统(terminology management system-TMS) 。使用字典存储和编码术语资源的管理软件。例如,STAR 的TermStar和塔多斯(Trados)的MultiTerm。

文字扩展(text expansion) 。翻译后的文字比源语言文字包含更多的字节和长度的特征。例如,凡以后的德文和法文,通常要比对应的英文长约30%左右。

繁体中文(Traditional Chinese-TC) 。主要用于香港和台湾的笔划比较多的汉字。与简体中文相比,字符编码方案、术语和语言样式都有很大不同。

软件汉化(Software Chinese build) 。根据源语言软件创建中文软件版本的过程,可以创建简体中文汉化版和繁体中文汉化版。

翻译记忆(translation memory-TM) 。能使用户在数据库中存储翻译的短语和句子的技术。

翻译记忆交换(translation memory exchange-TMX) 。由一些软件工具开发商设计的基于XML的开放标准。目的在于简化翻译记忆在不同翻译记忆工具之间自动转换的过程。

通用字符集(Unicode) 。对已知的字符进行16位编码的字符集,已经成为全球字符编码标准。

用户界面(user interfance-UI) 。软件中与用户交互的全部元素的集合,包括对话框、菜单和屏幕提示信息等。

UTF-8 。支持ASCII向后兼容和覆盖世界绝大多数语言的一种Unicode编码格式。UTF-8是8-bit Unicode Transfer Format的简写。

Windows 帮助(Windows Help-WinHelp) 。包含编译的.hlp和.cnt内容文件的联机帮助系统。通过创建一系列RTF格式的文件和位图文件创建Windows 帮助。

本地化工具(localization application) 。提供软件界面资源重复使用的专用本地化工具软件,例如,Alchemy Catalyst和Passolo等。

翻译记忆工具(translation memory application) 。支持翻译记忆数据库的专用本地化工具软件。例如,Trados WorkBench。主要用于联机帮助和文档的翻译。

Back translation(回译) 。将已被翻译成另一种语言的文档再翻译回源语言的过程,最好由独立的译员执行。

Character(字符) 。代表写入系统或脚本中最小抽象组件的符号,包括声音、音节、概念或元素,与字型相对。

Corpus(语料库) 。大量自然语言文本的集合,用于收集有关自然语言文本的统计资料。语料库常常包括一些额外的信息,例如表示每个单词的词性的标记。

Globalization (G11N,全球化) 。解决产品全球发布的相关商业问题。例如,产品经过适当的国际化过程和产品设计后,在整个公司范围统一进行本地化。

Internationalization (I18N,国际化) 。使产品具有普遍适应性,从而无需重新设计就可适应多种语言和文化习惯的过程。

Localization (L10N,本地化) 。即根据某个特定国际语言和文化对产品或软件进行改编和调整,使之如同该语言市场当地产出的一样。彻底的本地化需要考虑目标地区的语言、文化、习俗和其它特点,通常需要改变软件的写入系统,甚至可能需要改变键盘使用、字体、日期、时间和货币格式的设置。

Morphology(词态学) 。词态学的释义是指研究词的组成方式的学科,包括词形变化、派生和混合。也指对任何特定语言的组成方法和组合规则的研究,以及对词形变化方式本身的研究。

Original equipment manufacturer(OEM,原始设备生产商) 。即制造设备并出售给最终设备制造商的制造商,设备将以后者的商标或名字转售。

Parser(语法分析程序) 。一种计算机程序。将一组句子输入计算机后,该程序可根据给定的语法辨别句子的结构。语法分析程序有时也泛指句子由各式各样的信息单元组成的情况。

Pseudo-localization(伪本地化) 。伪本地化将产品的代码串转换成“伪字符串”。由此产生的伪语言可以用来测试本地化的不同方面对产品的功能和外观的影响。

Quality assurance (QA,质量保证) 。为确保产品或服务足以满足既定的质量需求而采取的一切必要的计划性和系统性行为。

Tidy functions(整理功能) 。整理是 Tidy HTML 清除/修补实用程序的一种绑定功能,它允许用户清除或以其它方式操纵 HTML 文件,并可用于遍历文件树。

Truncation(截断) 。在显示中截断文本行是指截断超出显示窗口边界的文本。另外,在数据库搜索中,指在单词末尾添加符号,以使计算机寻找这个单词的所有变体。

Unicode(统一码) 。统一码国际字符标准 (Unicode) 是一种字符编码标准,用来表示要进行计算机处理的文本。统一码最初设计为仅可支持 65,000 个字符,但目前的编码形式可以支持 100 万以上字符。

Usability(可用性) 。用户浏览界面、查找信息和获取知识的方便程度。

来源:中国本地化网
http://www.globalization.com.cn/

Teaching Machine Translation Evaluation by Assessed Project Work

Judith Belam
Foreign Language Centre, University of Exeter
Queen’s Building
The Queen’s Drive
Exeter EX4 4QH
UK
J.M.Belam@ex.ac.uk


Abstract The paper describes the use of an assessed independent study project on machine translation evaluation as part of a final-year undergraduate course on machine assisted translation. The advantages and potential drawbacks of the use of such a component are outlined. The usefulness of studying MT evaluation is underlined, as it includes consideration of many aspects of MT. The suitability of teaching MT to language learners is considered, as are the transferable skills students can hope to gain after following the course.

Introduction

The course, entitled “Machine-Assisted Translation” (MAT) is for final-year undergraduates in Modern Languages. It runs over one semester and is worth 15 credits of a 120-credit study programme. Students will have achieved a good working competence in at least one foreign language and will have spent a year in the target language country. They may have used computer-assisted language learning materials but no knowledge of MAT tools is assumed before they start the course.

The MAT course is not primarily intended as a language teaching course. Students will continue their language learning concurrently on a course consisting of oral classes, essay writing and short literary texts for translation into and out of the target language. The MAT course on the other hand aims to provide a broadly based introduction to MAT including the basics of text processing, practical aspects like pre- and post-editing and dictionary creation, and the history of MAT.

The assessment of the course falls into two parts. 50% of the marks are given for performance in a 2-hour written examination and 50% for an independent study project. The independent study project consists of an evaluation of one or more MT systems. The guidelines given to the students are as follows:

The portfolio takes the form of a report on an evaluation of a machine translation system, which you will design and carry out. You will be studying the subject of machine translation evaluation formally in weeks 6 and 7, and you will then be ready to proceed with the preparation of your portfolio. It is expected that you will begin with a short introduction outlining the principles of machine translation evaluation; this will include types of evaluation and questions of why, by whom and for whom evaluations may be carried out. You will then describe the type of evaluation you yourself have decided to do: you may be focussing on output (accuracy, readability, coherence …) or on the system itself (ease of use of dictionaries, number of additional features available, handling of text within word processing programs …); or you may have decided to look at the way the program handles specific problems (subject domains, or specific grammatical structures); or you may address a practical problem like how much texts need to be pre- or post-edited in order to arrive at an acceptable translation. You will give your criteria, explaining why you have chosen them, and describe the way in which you have decided to test the system. Your evaluation may be comparative (Systran vs. Globalink) or may focus on one of the systems. Finally you will report on the process of your evaluation, give details of the tests you used and how you arrived at your conclusions.

The whole portfolio will not exceed 3,000 words; this does not include any examples you give. Machine-translated texts and their originals are not included in the word count and should be attached in a separate appendix.

As you will see from the programme there are several sessions devoted to supervised work on this assignment and it is expected that you will discuss your methods and progress with the tutor in the early stages.

Students therefore design and carry out their own mini-evaluation. Examples of some subjects which were chosen were:
· terminology and dictionary tools in Systran and Globalink
· the translation of phrasal verb constructions
· comparative evaluation of Systran and Globalink’s translations of children’s non-fiction
· comparative evaluation of the translation of jokes

Supported self-study

There is an inherent contradiction in the concept of a self-study project which is to be assessed and will contribute to the final mark gained by the student. For it to form part of the students’ learning the tutor must provide guidance and feedback. On the other hand, for it to form a fair part of the assessment, it must represent the student’s own unaided work and independent competence. We have tried to satisfy these conflicting requirements by providing extensive support while the projects are in preparation, but leaving the student to produce the final version on their own. In practice students are well aware of the unwritten rules governing this type of project and rarely seek to gain advantage by asking for too much help in the later stages.

Why teach MT evaluation?

Full-scale MT evaluation is a specialist field clearly beyond the scope of a short university course. Students are not expected to do extended research into the field itself, but have two lectures outlining the basic principles and are given some additional reading (Hutchins and Somers 1992, Trujillo 1999, Somers in press). Within the limited scope of their projects, and working entirely in the abstract, it is unlikely that they will arrive at practically useful conclusions about any particular system. However, the value of including a study of evaluation in the course is 3-fold.

(i) it is more than likely that lan­guage graduates will be expected to consider the use of MT systems at some stage in their career. Having studied MT evaluation in this way, they will be equipped to provide a sensible and realistic answer to obvious questions like: “Is MT any good?”“Can it save us money?”“Which system should we buy?” – questions which are easily asked but not so easily answered.

(ii) Evaluation of MT output requires students to take into account many other areas covered during the course. They are encouraged to consider not just the raw output but related questions like the amount of pre- or post-editing that is necessary or the impact of using the dictionary tools. In this way the project constitutes an important stage of their study of MT, when they must bring together all they have learned and consider how it is put to use. The project is marked according to the following criteria:

1. Knowledge of the principles of evaluation
2. Awareness of how project chosen fits into the context of these principles
3. Realism of project
4. Appropriate development of evalua­tion method, e.g. creation of scale
5. Choice of source materials appropriate for MT
6. Awareness of characteristics of texts/materials chosen
7. Rigorous performance and analysis of evaluation
8. Critical assessment of project and suggestions for improvement

(iii) Finally, it obliges students to examine their preconceptions and their first impressions of MT. These may vary widely (Gaspari 2001). Some students, comparing instinctively with the careful manual translation which their concurrent language courses require, tend to be very dismissive of raw MT output. Others, who may already use it for translating material online, tend on the other hand to overestimate its capabilities. At least one student was rash enough to confide that she was in the habit of using Babelfish for a first draft for her literary translation work! Work on MT evaluation must be set in some sort of context, and this forces all students to realise that the question “How good is this translation?” is just about as useful as asking “How long is a piece of string?” They must define in what circumstances the translation is likely to be made, for whom and why.

The success of the independent project component

The exercise has proved to be a very valuable one. In my experience an assessed self-study component in general is an extremely effective way of motivating students. Some reasons for this are rather negative: they cannot leave the work until the last minute, or leave out vital areas in the hope that the exam will not test them on a particular aspect. However there are also more positive advantages.

Requiring students to choose their own topics has obvious benefits. They tend to be become more interested in a subject which they have chosen themselves, and in which they have invested a lot of time and effort. They are more engaged with the project from the outset; they remain more committed; they work more independently. They are less likely to become bored, or crib from textbooks or from each other. A further benefit is that they will inevitably come up with ideas the tutors would not have thought of, thus broadening the whole scope of the course.

In some cases the process of choosing the topic turned out to be a useful learning experience on its own. Some students started on one area, only to discover unexpected problems or difficulties which necessitated a change of direction, and often this change taught them as much as the subsequent work on the project. Some, for example, arrived at a better understanding of the difficulty of arriving at an absolute standard for evaluation when they discovered that it was easier to compare two systems than to concentrate on just one. Another candidate began looking at children’s non-fiction, to realise that the translation of entire books by MT would be an unlikely use for the system. He refined the scope of his investigation so that his source texts consisted of short passages suitable for introducing displays designed for children in exhibitions or museums. Within this more practical context he was able to go on to produce a more reliable evaluation, and he arrived at a more realistic appreciation of the contexts in which MT is likely to be used.

Some difficulties, however, also arose with the choice of topic. Some weaker students had enormous difficulty choosing a topic and required very detailed guidance, which raises questions about the fairness of the exercise, given that it counts towards the final mark. It is hard to take into account objectively the amount of help given to an individual. Secondly, some chose topics which were much more difficult than others. The candidate who chose to look at the translation of humour plainly had an enormous amount of background research to do on the subject of what constituted a joke, how far the humour lay in the language and how far in the subject matter, cultural background, personality of the reader, etc. Compared with this, the student who decided to compare the usefulness of the dictionary tools in Systran and Globalink had a relatively easy ride. Once again this proved difficult to take fairly into account when assigning a final mark. Finally, and this is a difficulty related to teaching MT evaluation in general, some students had trouble limiting the scope of their enquiry. Sometimes this was simply due to an inadequate appreciation of the whole context in which the evaluation should be considered. One very general project, for example, simply took a series of different texts, including extracts from a children’s fairy tale, a videogame instruction manual, a commercial order form ,and a medieval epic poem. Needless to say 3000 words were insufficient to cover all the issues raised. Sometimes the problem arose because students had not understood the level of detail required in their analysis. Another candidate attempted to analyse passages with specialised terminology taken from twelve different subject domains and translated into both French and German. My advice to reduce this enormous task resulted only in a reduction to seven topics and the resulting project contained more description than evaluation and analysis, as the student was overwhelmed with the sheer volume of material to be examined.

Is the study of MT evaluation suitable for language learners?

This course does not set out to be a language course as such. It assumes a good level of knowledge of the foreign language in order to learn about MT. However, the students are all language learners and it is therefore important to consider the suitability of the subject. Two main areas need to be addressed:

(i) It is important to consider whether it is appropriate to expect language learners to give an opinion on the relative quality of a given translation. In all their other language work the greatest care is taken not to expose them to incorrect models, and it is the role of the teacher to assess quality and to reject anything which is inadequate. Work on MT evaluation, on the other hand, not only assumes that the learner is competent to give an opinion, but also expects him to give a relative judgment, and to accept as adequate for a given circumstance a version which would otherwise be thrown out as “wrong”.

I believe that while this difficulty must be recognised, it should not be overstressed. Firstly, the whole concept of “fit-for-purpose” translation, discussed in more detail below, is so important that it must override the principle of not using incorrect models in teaching. Secondly, students are more than capable of making these distinctions for themselves. After half an hour spent trying out the systems they will be well aware of the difference between raw output and high-quality human translation, and so long as this difference is constantly borne in mind the “damage” done by exposure to errors will be limited.

(ii) It is perhaps more important to ask whether teaching MT evaluation is actually helping students’ language learning. In theory, the course is not specifically designed to improve students’ language skills. Furthermore, students’ competence in the language does not form part of the assessment criteria, especially as the course is designed to be non-language-specific.

In practice their competence does of course affect the quality of the work submitted as part of the portfolio. Students frequently expressed the concern (in some cases with a certain amount of justification) that they were not able to judge the appropriateness of a particular version because they did not feel they had a good enough appreciation of the language. In this way the course can in practice contribute towards improvement of their knowledge of the foreign language. A consciousness of any gaps or uncertainties can only be useful, and did appear to lead in many cases to the kind of intensive dictionary and grammar work which classical translation courses aim to stimulate.

Transferable skills

When the course was first set up considerable attention was paid to the skills students could expect to acquire which would be of use to them in areas of activity outside the field of MT (Belam, 2001). The independent study project has indeed furthered the acquisition of some of these skills, although students are sometimes resistant to being expected to improve their competence in areas they do not see as specifically related to the course (“I didn’t come here to improve my English” was a complaint heard more than once, even though a thorough competence in the mother tongue is an essential prerequisite to being a good translator). In particular students gain a broader understanding of an area which is usually formally avoided on conventional language courses, that of the value of imperfect communication. I have described how, when preparing their projects, they are encouraged to consider the type of texts they are using, and in particular the circumstances in which they will be translated. The quality of the translation is then assessed not in the abstract but in the particular conditions which have been specified, and a relatively poor translation can sometimes be seen as perfectly adequate for a particular purpose. One candidate, for example, took online newspaper articles as his source material and showed a good understanding of the quality of translation which would be required, explaining: “In the case of a newspaper report which just reports facts and not feelings, the loss of the writer’s style or register is not that important. The sense and meaning of the article should still be portrayed when the text has been translated”. This awareness of the “fit-for-purpose” translation is a valuable element of the course.

There have also been some unexpected additional gains which have come out of the course. The projects were demanding and complex to design and set up, and several students underestimated the time it would take to assemble, analyse and present their findings. This appreciation of the importance of the “writing-up stage” will be important to students taking up any form of research or report writing. Several of them were introduced to the use of metrics and statistical methods, and some of the issues associated with interpretation and presentation of the figures they produced. They were all brought to consider the importance of the computer/user interface, and the value of understanding some of the workings of the system in order to get the best out of it.

The most important unexpected effect of doing the course, however, was what one could call an increased linguistic awareness. At least one student said to me partway through the course that he found that the course was having a distracting effect on his other study, as he had got into the habit of noticing linguistic features of what he was reading which would make a text suitable or unsuitable for machine translation. “It’s a whole new way of looking at language” were his words. He was almost complaining at the inconvenience, but at the same time he realised that he was in the process of gaining an whole new perspective on language.

Conclusion

The inclusion of an independent self-study project on evaluation in the MAT course has proved to be a very valuable aid to students’ learning. It is a demanding exercise which furthers understanding not only of evaluation techniques but of MAT as a whole.

References

Belam, Judith (2001). Transferable Skills in an MT course. MT Summit VIII Workshop on Teaching Machine Translation, Santiago de Compostela, pages 31-34.

Gaspari, Federico (2001). Teaching Machine Translation to Trainee Translators: a Survey of the Knowledge and Opinions. MT Summit VIII Workshop on Teaching Machine Translation, Santiago de Compostela, pages 35-44.

Hutchins, W.J. and Somers, H.L. (1992) An Introduction to Machine Translation. London, Academic Press.

Somers, H.L. (in press) Computers and Translation: A Handbook, to be published by John Benjamins; copy made available in draft form by kind permission of the author for consultation by our students. On evaluation, specifically the chapter by John White.
Trujillo, Arturo (1999) Translation Engines: Techniques for Machine Translation. London, Springer.

机器翻译

李维 加拿大温哥华

机器翻译又称自动翻译, 是按照规定的算法由电子计算机进行语言翻译。它是计算语言学的主要研究领域之一。

机器翻译通常由机器词典和语言规则库支持, 其对象为自然语言。机器翻译是一种自然语言处理应用软件。与此相对应, 还有一种系统软件, 专门用于把用计算机语言编写的程序自动翻译成可执行的机器代码, 这在计算机科学中叫编译器或解释器。编译理论和技术已经相当成熟, 它与自然语言的机器翻译有相通之处。

与计算机语言相比, 自然语言有两个明显的特点:

首先, 自然语言普遍存在同形多义现象。在词汇层, 一词多义, 词类同形等现象随处可见, 而且越是常用的词其意义和用法越多; 在句法层, 结构同形也相当普遍, 同一种结构也可能表达多种含义和关系。因此, 区分同形和多义成为机器翻译的首要任务。

其次, 自然语言是规则性和习惯性的矛盾统一体。自然语言中, 几乎没有一条语法规则没有例外。然而, 如果把语言规则组织成从具体到抽象的层级体系, 区别个性规则和共性规则的层次, 建立个性和共性的联系方式, 就为解决这一矛盾创造了条件。因此, 在设计机器翻译系统的算法时, 如何把握和处理个性与共性的关系, 在很大程度上决定了系统的前途。

机器翻译通常包括五个环节: 源语输入; 源语分析; 源语到目标语的转换; 目标语生成; 目标语输出。

对于书面语, 输入和输出是纯技术性环节。语音机器翻译则还必须赋予计算机以听和说的能力, 这是语音识别和语音合成所研究的课题。

源语分析的结果用某种中间形式表示。转换包括词汇转换和结构转换, 它反映源语和目标语的对比差异。生成是分析的逆过程。可见, 只有转换才必须同时涉及两种语言, 源语分析和目标语生成可以相互独立。这种设计思想称作转换法, 是当前机器翻译系统的主流。当然, 也可以把转换放到分析或生成中, 用所谓直接法进行自动翻译。

直接法和转换法各有其优缺点。运用直接法的系统结构紧凑, 翻译过程比较直观, 规则的编制易于参照现成的双语词典、对比语法以及前人长期积累的翻译经验。其主要缺点是, 由于分析和生成不能独立, 使得分析和生成都难以深入; 另外, 对于多种语言之间的自动翻译, 直接法是不适合的。转换法也有缺点: 尽管可以分析得比较深入, 但多了一个环节, 多了许多接口信息, 处理不好反而影响译文质量; 另外, 在不同语系的语言之间, 要想得到较高质量的翻译, 其转换模块(主要是词汇转换)势必很大, 大到与分析和生成模块不相称的地步, 这差不多等于回到了直接法。看来, 对两个差别比较大的语言进行自动翻译, 直接法还是很有效的。

究竟分析到哪一步实施转换, 是由系统的设计目标, 加工对象和研究深度等条件决定的。从上图可以看出, 分析越深入, 转换便越少, 最终达到没有转换。分析一下两极的情形是很有意思的, 即: (1) 只有转换的翻译; (2) 没有转换的翻译。

只有转换的翻译是一一对应的翻译, 不需要分析和生成。翻译只是机械的数据库查询和匹配过程, 谈不上任何理解。需要指出的是, 对于语言中纯粹的成语和习惯表达法, 这种翻译方法不仅是有效的, 往往也是必需的。

机器翻译的另一极是建立在充分理解基础上, 毋须转换的自动翻译, 这是从实质上对人的翻译过程的模拟。这时候, 源语分析才是真正的自然语言理解, 机器翻译才真正属于人工智能。然而, 这里遇到两个难题: 一是知识处理问题; 二是所谓元语言问题。

考察人的翻译活动, 可以发现, 人是靠丰富的知识在理解的基础上从事翻译的。这些知识既包括语言知识, 也包括世界知识(常识、专业知识等)。如何组织这些包罗万象的百科全书一样的知识, 以便适应机器处理和运用的需要, 是人工智能所面临的根本性课题。

另一方面, 人类可以用语言交流思想, 语言可以相互翻译, 必定有某种共同的东西作为基础, 否则一切交流和翻译都是不可思议的。概念, 或者更准确地说, 概念因子(即构成各种概念的元素)是全人类一致的。概念与概念间所具有的逻辑关系和结构也是全人类共同的。如果人们可以把这种共同的东西研究清楚, 把它定义成元语言, 源语分析以元语言作为其终极表达, 目标语生成也以元语言作为出发点, 就不需要任何转换了。这时候, 源语分析和目标语生成便完全独立, 每一种语言只需要一套针对元语言的分析和生成系统, 就可以借助于它自动翻译成任何其他语言。研究元语言是认知科学中的一个难题, 有待于语言学家, 逻辑学家, 心理学家, 数学家和哲学家的共同努力。有意义的是, 研究机器翻译的学者们设计过种种近似元语言的方案, 作为多种语言之间自动翻译的媒介语, 取得了一定的成果和经验。

总之, 虽然机器翻译的最终出路在于人工智能的理论和技术的突破, 但在条件不成熟的时候过份强调机器翻译的人工智能性质, 一味追求基于知识和理解的自动翻译, 对于应用型机器翻译系统的研制, 往往没有益处。

除了上述的两极, 人们根据转换所处的层次, 把机器翻译系统大致分为三代:第I代是词对词的线性翻译, 其核心是一部双语词典, 加上简单的形态加工(削尾和加尾)。

I代系统不能重新安排词序, 不能识别结构同形, 更谈不上多义词区分。

第II代系统强调句法分析, 因此能够求解出句子的表层结构及元素间的句法关系(分析结果通常表现为带有节点信息的结构树), 从而可以根据源语和目标语的对比差异进行句法结构的转换和词序调整, 这就从线性翻译飞跃到有结构层次的平面翻译。然而, 在没有语义的参与下, 虽然可以识别句法结构的同形, 但却不能从中作出合适的选择; 多义词区分问题也基本上无法解决。

第III代系统以语义分析为主, 着重揭示语句的深层结构及元素间的逻辑关系,可以解决大部分结构同形和多义词区分问题。

目前, 多数机器翻译系统处于II代,或II代和III代之间。纯粹以语义分析为核心的III代系统只做过小规模的实验(Wilks, 1971), 但也取得了令人瞩目的成就。从工程和实用考虑, 大型商品化机译系统的研制, 采用句法分析与语义分析相结合的方法, 是比较切合目前的研究水平和实际需要的。

从方法上看, 语言规则和算法分开是自动翻译技术上的一大进步, 算法从而成为系统的控制器和规则的解释器。早期的机器翻译系统并没有专门的语言规则库, 而是把规则编在程序中, 这带来三个严重的缺陷: 第一, 规则的每一点修改都要牵涉程序的变动; 第二, 无法提高机器翻译算法的抽象度, 从而影响了语言处理的深度和效率; 第三, 不利于语言学家和计算机专家的分工合作。

值得强调的是, 规则与算法分开以后, 只是从形式上为规则的增删修改提供了方便, 真正的方便取决于规则的结构体系, 具体地说, 就是规则与规则的相互独立程度。如果规则彼此依赖, 牵一发而动全身, 就谈不上修改规则的自由。这样的网状规则系统在规则数达到一定限量以后, 就无法改进了: 往往改了这条, 影响那条, 越改越糟, 最终可能导致系统的报废。因此, 在规则和算法分开以后, 有必要强调规则与规则分开。

随着信息社会的到来, 人工翻译的低效率已远远不能满足社会的需求, 迫切需要计算机帮助人们翻译。目前, 世界上已有一批机器翻译系统投放市场或投入运用, 更多的系统正在积极研制中。而英汉机器翻译也已有高科技产品问市。在大陆,继“译星”一鸣惊人后, 近年又有两套英汉系统分别投放市场, 一套为中国社会科学院语言研究所和北京高立电脑公司所研制开发,另一套是中国科学院的863项目,竞争日趋激烈。机器翻译经过40多年的发展, 对语言的认识逐步深入, 发展了许多行之有效的语言处理技术。其前景是令人乐观的。

By Wei Li LIO@SFU.CA (In GB code)

本文是作者应约为科技辞书写的辞条, 现略加修改, 力求深入浅出, 既反映本学科的最新发展水平, 又能让一般读者容易理解。

澳門之變


一向喜歡澳門,喜歡她的人文氣息、古意盎然和小南歐風情。澳門人好客純樸,生活步伐慢。在澳門街頭,很優閒、很自在,有如漫步郊野,與香港那種緊湊得讓人窒息的感覺不同。

不過最近再到澳門,發現澳門變年,才不過一年多,澳門原來給我那種優閒自在、空間寬敞的感覺沒有了,原因是遊客太多,滿街都是人。

在新馬路坐車的經驗最差,公車班班滿座,全街都是招手叫的士的遊客,但的士少得可憐,常常不停車載客。我帶著兩個小孩,很徬徨的等了半小時,才截到一輛的士。問司機該處是否禁區,不能停車?司機說是,但停也無妨,因為警察不抓,至於的士不停,是因為「挑客」,並非因為禁區。

司機說,由於需求甚殷,澳門的士牌照已炒至四百萬澳門元,在假期,他一天可做超過一千元生意,收入甚佳。

汽車需求大的惡果,在澳門隨處可見,一些主要道路經常塞車,人們要多花很多時間在交通上。

另一個不好的經驗,是旅客對景點的污染。在澳門博物館外,一個福建口音的大陸遊客突然口鼻同時噴出鼻涕口水,差點噴中我女兒,氣得她哇哇叫。以為避過一劫,怎料該遊客隨即連噴第二口,我的鞋子不幸沾了一點,害我整天坐立難安。 返港的時候,有人叫我們去葡京門外找的士,會比較容易,結果葡京一輛的士都沒有,卻有至少一百人在排隊輪候。我們怕來不及坐船,決定多花五倍價錢坐三輪車,這短短十餘分鐘車程,卻是女兒覺得最刺激好玩的節目。

原載:2005-07-26 《世界日報》── 藝文天地

欧盟多语种词库Eurodicautom


Eurodicautom is the European Commissions multilingual term bank. Eurodicautom covers a broad spectrum of human knowledge, but is particularly rich in technical and specialised terminology (agriculture, telecommunications, transport, legislation, finance) related to EU policy. New data are added constantly by a team of Commission terminologists, supported by technical staff, using a special program (Edictor) to process terminology obtained from Commission terminologists, translators, linguists from other European and international institutions, research centres, publishers, private experts, etc.

Entries are classified into 48 subject fields (ranging from medicine to public administration). A typical entry contains the term itself and its synonyms, together with definitions, explanatory notes, references, etc. At present the term bank contains about five and a half million entries (terms and abbreviations), subdivided into more than 800 collections.

http://europa.eu.int/eurodicautom/Controller

全文翻译技术发展脉络

作者:闫宏志 2003年03月26日 本文选自:中国计算机报-赛迪网  

自上世纪90年代以来,机器翻译领域的方法基本上可以分为两大类,即基于规则(Rule-based)的方法和基于语料库(Corpus-based)的方法。基于规则的方法是传统的方法,而基于语料库的方法是80年代以后逐渐发展起来的方法。基于规则的机器翻译(Machine Translation, MT)MT又可以分为基于转换的方法和基于中间语言(Interlingua-based)的方法,而基于语料库的方法又可以分为基于统计(Statistic-based)和基于实例(Example-based)的方法。由于没有哪种机器翻译方法能够取得令人满意的效果,于是,多引擎的思想自然就成为一种提高机器质量的手段。而且这种方法也确实有效。现在,多引擎的方法在机器翻译系统的开发中已得到广泛采用。   

基于规则的机器翻译   

基于规则的机器翻译的技术是最成熟的,也是到目前为止应用最广的,目前有影响的机器翻译系统都是基于规则的。基于规则的机器翻译系统就是对语言语句的词法、语法、语义和句法进行分析、判断和取舍,然后重新排列组合,生成等价的目标语言。   

基于中间语言的方法是对源语言进行分析后产生一种成为中间语言的表示形式,然后直接由这种中间语言的表示形式生成目标语言。所谓中间语言就是自然语言的计算机表示形式的系统化,它试图创造出一种独立于各种自然语言,同时又能表示各种自然语言的人工语言。   

基于规则的机器翻译发展到今天,相对来说已比较成熟。虽然经过长期的努力,人们已经建立含有成千上万个规则的规则库,覆盖了相当大的语言现象,但是从理论上讲,这种过程仍然很有限。因为语言是一个民族经过几千年的积累,是约定俗成而又动态发展的。随着社会的不断发展,新的词汇和语言现象不断出现。现有的机器翻译系统的规则再多,也只是特定语言现象的概括和总结。因此,基于规则的机器翻译方法借鉴了其他方法的优点,并产生很多变化,主要体现在以下几个方面:   

在规则的获取方面,传统的规则方法主要依靠语言学家总结规则进行调试,而现在更加重视从语料库中获取规则(如采用错误驱动的学习算法);

传统的规则方法往往偏重描述粗粒度的、全局化的大范围语言学规则知识,而现在则更加重视描述细粒度的、局部的小范围的语言学知识,呈现出“小规则库、大词典”的趋势;   

在知识表示方面,为了以更小的粒度、更加准确地对翻译知识进行描述,一般要对单纯的上下文无关规则做一些改进;   

传统的规则方法采用的往往是非此即彼的确定性原则,系统的鲁棒性较差,而现在规则系统中一般都引入各种形式的概率或评分函数,系统的鲁棒性有所提高。   

基于语料库的机器翻译方法   

基于统计的机器翻译方法和基于实例的机器翻译方法都是使用语料库作为翻译知识的来源。二者的区别在于:在基于统计的机器翻译方法中,知识的表示是统计数据,而不是语料库本身,翻译知识的获取是在翻译之前完成,翻译的过程中不再使用语料库;而在基于实例的机器翻译方法中,双语语料库本身就是翻译知识的一种表现形式(不一定是惟一的),翻译知识的获取在翻译之前没有全部完成,在翻译的过程中还要查询并利用语料库。   

统计翻译的数学模型是由IBM公司的Brown等人提出来的。统计机器翻译的基本思想是,把机器翻译看成是一个信息传输的过程,用一种信道模型对机器翻译进行解释。假设一段源语言文本S,经过某一噪声信道后变成目标语言T,也就是说,假设目标语言文本T是由一段源语言文本S经过某种奇怪的编码得到的,那么翻译的目标就是要将T还原成S,这也就是一个解码的过程。

统计机器翻译问题被分解为三个问题:语言模型的参数估计;翻译模型的参数估计;搜索问题,寻找最优的译文。基于统计的方法需要大规模双语语料,其翻译模型、语言模型参数的准确性直接依赖于语料的多少,其翻译质量主要取决于概率模型的好坏和语料库的覆盖能力。同时,翻译模型、语言模型在简化过程中也带来一些缺陷,在简化和可行之间存在一个权衡问题。基于统计的方法不需要对大量知识的依赖,直接靠统计结果进行歧义消解处理和译文的选择,避开了语言理解的诸多难题。但是,语料的选择和处理不但工程量大,而且需要同实际处理问题相似。因此,通用领域的机器翻译系统很少以统计方法为主。  
 

基于实例的方法(EBMT)   

基于实例的机器翻译(Example-Based Machine Translation)思想最早是由著名的日本机器翻译专家长尾真(Nagao. M.)提出的。其基本设想是不通过深层的分析,而仅仅通过已有的经验知识,通过类比原理进行翻译。人类的翻译过程是首先正确分解输入句子,将句子分解为短语碎片,接着把这些短语碎片译成其他语言短语,最后把这些短语合并成长句。每个短语碎片采取类比的原则进行翻译。这一方法的基本原理归纳起来很简单:系统的主要知识源是双语对照的翻译实例库,每当输入一个源语言句子S时,系统找出和S最为相似的句子S',并模仿S'的译文T'构成S的译文T然后输出。这种方法需要一个很大的语料库作为支撑,为构建语料库需要投入巨大的人力和物力。   

基于实例的机器翻译方法具有以下一些优点:   

系统维护容易 系统中知识以翻译实例和语义词典等形式存在,可以很容易地利用增加实例和词汇的方式扩充系统。   

容易产生高质量的译文 尤其在利用了较大的翻译实例库,或者输入能和实例精确匹配时更是如此,同时也可以避免一些传统的基于规则机器翻译必须进行的深层次语言学分析。   

同语种相关的知识很少 只要记忆库中存在外形同输入相似的句子,就可以进行匹配。而从语料库中获取的知识颗粒度比较小,对自然语言的刻画更为细腻、真实和准确。   

由于大规模获取语言知识的代价非常大,对于词法、语法和语义的规则的收集概括难以全面,机器翻译系统的性能一直徘徊不前。利用已经存在的双语语料库资源为新的翻译需求提供经验,是目前提高机器翻译系统译文质量的重要途径之一。EBMT对于相同或相似的文本的翻译有非常显著的效果,随着例句库的规模的增加,其作用也越来越显著。对于实例库中已有的文本,可以直接获得高质量的翻译结果。对于实例库中存在的实例十分相似的文本,可以通过类比推理,并对翻译结果进行少量的修改,构造近似的翻译结果。   

基于实例的翻译具有众多的优点,在具体实现上又是千差万别,很多地方还有相当大的潜力,近年来一直是机器翻译研究的热点之一。但由于语料库规模的限制,基于实例的机器翻译很难达到很高的匹配率,因而,到目前为止还很少有机器翻译系统采用纯粹的基于实例的方法,一般都是把基于实例的机器翻译作为多翻译引擎中的一个,以提高翻译的正确率。