Technology
   
IDC
International Data Corporation

SYSTRAN and the Reinvention of MT

Mary Flanagan, Steve McClure
IDCBulletin #26459 - Jan 2002

Table of Contents - Abstract - Document



IDC Opinion

Can SYSTRAN's new technology extend the perception of machine translation (MT) as a tool for more than simple gisting?

Yes. With its newly architected system, its persuasive hold on the multilingual Web-browsing market, and its successful incursions into the customer service area, SYSTRAN is poised to achieve a level of success that has not yet been realized by any other MT company and perhaps to redefine the field of uses for automatic translation solutions. SYSTRAN will need to move cautiously, given MT's history, and avoid the obvious pitfalls of unrealistic expectations and inappropriate applications for the technology. Having demonstrated the ability to navigate these challenges successfully thus far, there is every reason to expect SYSTRAN will succeed.

Audio Link

Click Here for Audio File

Marketing Evolution, Technology Revolution

The conventional wisdom among marketers of machine translation (MT) products is that MT is a technology best suited to creating rapid, rough draft translations. This application, known as "gisting" utilizes translation software for multilingual information scanning in situations where the importance of quick availability of information exceeds the need for precise translation. Gisting is a concept that has been used both to educate users to the potential of MT and to evangelize one of its broadest uses. Even discussing MT for high-quality translation has become taboo among MT marketers who fear associating themselves with fly-by-night vendors of word lookup systems who claim to translate among hundreds of language pairs with high accuracy.

But just as the conventional wisdom has become established, it is now being challenged by the most conventional of MT companies, SYSTRAN. The idea that MT can be used for high-quality translation is not new. In fact, the earliest researchers of MT at Georgetown University and IBM believed that the translation problem could be solved fully and quickly. MT would replace human translators and deliver fully automatic, high-quality translations at a fraction of the time and cost. Their naivete and underestimation of the enormous complexity of modeling human language is understandable. But early results disappointed U.S. government funders, leading to a near cessation of MT research for almost 15 years.

SYSTRAN's claim is less far-reaching than the claims of MT's pioneers. More importantly, it reflects the benefits of three decades of experience and a major technology rearchitecture. Nonetheless, it is a bold and perhaps risky move to challenge the prevailing assumption about MT. It is all the more surprising coming from a traditionalist company like SYSTRAN, known for its caution and conservatism. The claim is not an empty one; SYSTRAN can back it up with metrics and benchmarking of translation results. Users are engaged in the deployment process to establish acceptance at each phase of implementation before going forward.

The SYSTRAN Story

SYSTRAN has the longest history of any MT developer in the world. The company was founded by Dr. Peter Toma, one of the scientists who worked on the original U.S. government MT projects at Georgetown University. SYSTRAN was founded in 1968, and has an R&D investment measured in thousands of person-years. The resulting system is the Goliath of the MT world. SYSTRAN's dictionaries are enormous and diverse, it offers more language pairs than any other MT system (35) and has broad coverage of grammatical patterns across many text styles.

With such a long history, it is not surprising that SYSTRAN is viewed both within and outside of the MT community as a stable, respectable technology but not as a cutting edge innovator. However, SYSTRAN's enduring image does not reflect its current reality. The system has undergone a comprehensive redesign to bring its code and linguistic resources into step with current computing standards and to add many new cutting-edge linguistic resources. Unlike many of its competitors, SYSTRAN has managed to navigate the treacherous waters of Internet applications more successfully than any other MT provider. SYSTRAN has focused on multilingual Web browsing, and more recently on Web-based customer support. The deployments at Alta Vista, Google, Autodesk and others have been successful, while other Internet translation applications have failed to generate adequate audience and revenue. The demise of e-lingo, Wholetree, Lernout & Hauspie, and Logos are cases in point. More importantly, the success of these applications was an important step in building general acceptance of MT as a useful tool for rapid gist translation.

SYSTRAN's rebirth may be a unique event in the MT industry. Mature MT systems are tremendously complex with deeply intradependent code. As a result, modernizing these systems is often prohibitively costly. Quick fixes tend to have limited impact, and regression is a constant danger because of the system's complexity and dependencies within the code. The corporate culture of an established company can also serve as a barrier to significant change. Entrenched patterns of problem solving and utilization of staff tend to preserve the status quo. In sum, transforming an old MT system can seem as elusive a goal as finding the fountain of youth. Yet, SYSTRAN seems to have done just that with its newly released and redesigned system.

From Old World to New World

Building the Team

The redesign plan was a shared vision of SYSTRAN CEO Dimitris Sabatakakis and Pierre-Yves Foucou, the company's CTO. An entire development team was assembled in Paris, 6,000 miles away from SYSTRAN's original development site in La Jolla, California. Although much of SYSTRAN's development historically was conducted in the La Jolla (and more recently San Diego), the true command and control center for the company is now in Paris where Sabatakakis resides.

Staff functions in San Diego have been gradually shifted toward engineering aspects of system development, though a number of long-time SYSTRAN linguists still work at the San Diego site. Linguistic development for the new generation system, however, is being carried out primarily in Paris. The team of software engineers and computational linguists is led by Foucou, a computational linguist and former professor of linguistics and computer science at the University of Paris. The development effort is part of the European Union's Matchpad project, a European Union-funded effort to extend availability of MT systems for language pairs involving Polish and Hungarian.

Foucou's charter was to reengineer SYSTRAN toward several ambitious goals: to improve maintenance and scalability in the face of ever-growing linguistic resources, to improve efficiency of access to linguistic resources, to increase modularity, and to support emerging exchange standards.

Sabatakakis also recognized that the growing presence of MT on the Internet would create requirements for many additional language pairs, and that they would be required quickly. The typical development cycles of 2-5 years for a new language pair wouldn't suffice for fast-paced Internet deployments. Making the development of new languages a quicker process would be essential to keep SYSTRAN competitive. And along with the growing MT opportunity, components of SYSTRAN's technology might be reusable for other natural language applications such as content management and multilingual indexing if SYSTRAN's resources were modular enough to extract and adapt. The resulting system entails changes to all of these areas and is without doubt the most ambitious redesign of an MT system ever undertaken.

The New SYSTRAN Technology

SYSTRAN's redesign is comprehensive. It dramatically alters the system's basic structure and characteristics and introduces many new components as well. The extensive list of linguistic resources available within the SYSTRAN product allows vast customization potential. Users can customize their SYSTRAN application according to their text quality and type, their computing environment, required languages, and numerous other variables. Customization is essential to producing high-quality machine translation results. Simply used out of the box, MT software tends to have limited success because its knowledge bases are not equipped with the terminology and information needed for the subject area.

Modularity

Modularity contributes to ease of maintenance and reusability of resources and was thus an essential goal for SYSTRAN, whose linguistic resources are extensive. The redesign has modularized the code so that the output of each module is independent and can be used for external purposes as well as for input to the subsequent module.

SYSTRAN's former approach was monolithic, with a unique program for each language pair. The rearchitecture of components has created independent modules with more complex relationships. Modules exchange information to build the most relevant context. Another advantage of this approach is that modules of different generations can coexist. The newer modules have been designed to access SYSTRAN's knowledge bases, while the older modules apply very refined grammatical phenomena.

Foucou adds:

SYSTRAN has more resources than common hardware can support. MT design needs to integrate this constraint to optimize embedded knowledge to meet linguistics requirements. We compile resources into finite-state data structures to maximize efficiency. Future MT technology will have to aggregate multiple components into a multiagent architecture that is able to compute parallel results and find the most relevant translation among dozens or hundreds of alternatives.

Finite State Technology

Perhaps the most distinctive change is the effective use of finite state technology at many levels of the system. The hallmark of finite state technology is efficiency, and, thus, the approach is used in applications such as indexing documents for search and retrieval and spelling checkers because it can provide a constant access time to any record in a database regardless of the size of the resource. Finite state methods allow SYSTRAN to maintain its high performance despite its enormous linguistic and lexical resources. SYSTRAN has embedded finite state technology at a number of levels of its system, including morphology, conceptual description, and transfer dictionary encoding.

Dictionary Access

SYSTRAN's exhaustive dictionaries are one of its most valuable assets. But managing million-entry knowledge bases poses challenges for scaling, access, and management of duplication. An extremely robust database is needed to accommodate its dictionaries, which average 200,000 entries for European languages, while allowing for continued growth. Compounding this challenge is the fact that the system must be able to support thousands of lookups per second as the translation program iteratively analyzes word forms and attempts to locate the root form in the dictionary.

Duplication is also a problem because SYSTRAN, like most MT systems, uses bilingual dictionaries. For example, the English-to-French system will have a very similar but not identical dictionary to the English-to-Spanish system. Most dictionary entries have multiple targets in different languages. The dictionaries are compiled at runtime to minimize the demand on hardware resources. With 35 language pairs, the amount of duplication across dictionaries is enormous.

Having duplication can also create consistency problems when the same English term is coded with different grammatical tags in two different bilingual dictionaries. SYSTRAN's rearchitecture attacks the scaling, access, and duplication problems by introducing monolingual dictionaries. The monolingual dictionaries are maintained in addition to bilingual dictionaries and contain both simple and compound entries. The monolingual dictionary factorizes complex entries to a single access point via the headword. For example, "pilote de course automobile" (race car driver) is indexed on "pilote." At the subsequent level of description, only the additional information is encoded, reducing redundancy. The second-level dictionary is also generated at runtime from the first structure, improving efficiency.

Declarativity

Perhaps the most sweeping change to SYSTRAN's code is its conversion to a declarative system. Declarative programming is an innovation of the past decade and has largely replaced procedural programming, in which each minute step of the programming task is explicitly specified in the code. In a declarative system, the developer specifies the intended results of a programming task, typically using a graphical formalism that serves as a shorthand for describing the linguistic phenomenon. The details of how the task is conducted are implicit - they are defined by the system using the tools and resources that are made available to it. Nonetheless, the declarative approach cannot solve all linguistic processing problems. Some complex or idiosyncratic structures still require special processing.

Implicit Transfer

Transfer is a stage in the machine translation process in which the results of the analysis of the source language sentence are reordered according to a set of rules that embodies the structural relationship between the source and target language syntax. Transfer is a step carried out in so-called "transfer MT systems" such as SYSTRAN (other MT methods exist, but are beyond the scope of this bulletin). SYSTRAN has introduced implicit transfer methods into the redesigned system to simplify and speed the transfer process. The motivation for this is that some types of local expressions and verbal constructs have unique and complex internal structures and, thus, are hard to describe using transfer rules. Implicit transfer establishes parallel source and target descriptions for these phenomena, then aligns and generates a correct syntactic structure in the target based on the target description.

Exchange Format

Until recently, MT vendors had little interest in standardization. Their systems each utilized unique methods of description for language phenomena. This information was carefully protected and treated as proprietary trade secrets. As natural language applications become more numerous and diverse, the need for standardization is becoming evident, both as a way to facilitate exchange among natural language applications and as part of the gradual mainstreaming of MT within the software world.

SYSTRAN is developing a filter that provides full support of XML exchange format. The task is not a simple one because it requires defactorization of graphed entries and explication of implicit transfer patterns. Preserving the organization of the information is one of the biggest challenges. Nevertheless, SYSTRAN needs to forge ahead with this effort to enable its resources to be exported and to permit importing of external resources, such as glossaries, into the system.

NLP Components

Although many of SYSTRAN's natural language processing components are shared by all of its language systems, their modularity allows the user to create a customized environment best suited to the translation need, audience, and text type. The components include the following:

  • Document filter for separating text and formatting codes
  • Encoding and character set converter for interpreting common character encoding formats
  • Language recognizer for identifying the source language of the text
  • Preprocessor for identifying document types, such as chat, email, or structured text
  • Spell checker to perform spelling correction for misspelled items (Misspellings would otherwise go unrecognized by the system, sometimes resulting in adverse impact to translation.)
  • Sentence segmenter for dividing the text into sentences
  • Word delimiter to identify word boundaries for languages where blank spaces are not inserted between words
  • Lemmatizer, a tool for identifying and creating the variant forms of a word (e.g., develop, developing, and developed)
  • Part-of-speech tagger for identifying the grammatical function of each word in the sentence (e.g., noun, verb, and adjective)
  • Text synthesizer for production of the correct word forms in the target language
  • Semantic domain recognizer to identify the subject area of the text so that appropriate knowledge bases can be employed

SYSTRAN's resources also include a tool set designed originally for quality assurance tasks. The tools are useful in the deployment process to assess quality levels and determine the characteristics of the source texts. The tool set includes a concordancer, terminology extractor, and tools for measuring the quality and consistency of translations and of custom resources, such as dictionaries.

Risks and Possibilities

To remain a leader, SYSTRAN will need to preserve both its strong output quality and high-speed performance after the redesign. Although this has always been true, it will be all the more critical now that SYSTRAN's marketing message is beginning to target higher-quality applications.

Early results suggest that the system operates more efficiently than before and it produces greater throughput without loss of output quality. Preserving these advantages is important because SYSTRAN is facing emerging competition. Example-based and hybrid systems are in the works at a number of universities in the United States, Europe, and Asia, including USC's Information Sciences Institute and New York University.

These systems can have shorter development times than traditional approaches to MT because translation rules are generated automatically based on analysis of bilingual corpora. Although their development timetables are shorter, they are by no means short, and building an example-based MT system has its own unique set of pitfalls, such as the difficulty of finding aligned bilingual corpora from which to draw the examples.

Creating a robust, high-performance, fault-tolerant translation system to compete with SYSTRAN also requires substantial engineering resources. So, while the development timetable for new systems is shorter, it is by no means a trivial task, and SYSTRAN's position as an entrenched leader will not be easy to upset.

Another risk for SYSTRAN is the continued bad press that MT receives when linguistically naive users deploy MT applications. The level of education about MT's capabilities is very low in the United States, although it is somewhat better in Europe and Asia. The typical American user has little acquaintance with translation software, other languages, or the challenges and issues of translation. This leads to a tendency to oversimplify the translation task and assume that it can be performed perfectly by MT. This assumption, in turn, leads to failed deployments because the MT system cannot meet the expectations of the user.

In some environments, naivete has been coupled with intentional derailing of MT by human translators. Although many human translators have come to understand that MT does not compete with them, purists continue to point to MT's obvious foibles as evidence that it is not usable for any translation requirement. Combined with unrealistic expectations, this viewpoint is almost always lethal for MT applications. Prevention is the key, and SYSTRAN, as well as other MT developers, will have to continue to work hard to educate users and tune its translation services to the particular needs of its customer before the service is released for production.

MT companies have reduced the marketing hype that fueled unrealistic expectations, however, they still persist and only broad use and familiarity with MT will completely do away with them. Herein lies the catch-22 of MT - customers need to use MT to truly understand its value, but they must first understand what it does in order to use it successfully. Gradually, this barricade is eroding, and continued successful deployments of MT will eventually unleash the tantalizing and enormous potential that everyone in the industry can see but none have yet realized.

Although the risks that SYSTRAN faces are serious, the possibilities are compelling. The new SYSTRAN is a robust, modular natural language analysis and generation system with deep lexical development in many domains and a highly customizable set of natural language processing tools. While MT has been on the fringes of success for many years, less comprehensive natural language applications are seeing some real interest and success. Natural language techniques, such as morphological analysis, semantic networks, noun phrase identification, and text normalization, have been introduced into content management, search, and cross-lingual information retrieval with success. SYSTRAN's linguistic resources are unparalleled among commercial MT systems. The company can make incursions into these other areas of text analysis if its resources are modular, efficient, and exchangeable.

In particular, the content management industry is ripe for machine translation as the balance of languages on the Internet shifts away from English as the majority. In fact, IDC has recently published a study (Internet Commerce Market Model version 7.3, 2002) demonstrating that Internet users in Western Europe now surpass the number of U.S. users (see Figure 1). For information-based businesses, content is a corporate asset that must be carefully managed. SYSTRAN customizes its technology for content management to help customers structure their content, distribute it more broadly, create abstracts, and manage terminology. The MT market is maturing but slowly, and it may turn out that content management applications are the growth engine for SYSTRAN in the near term.

A Changing Competitive Arena

The 18-month period between mid-2000 and the present has been more eventful for MT than the entire previous decade. The collapse of Lernout & Hauspie, its spin-off of Sail Labs, the series of failed acquisition attempts of the Barcelona technology, SYSTRAN's Internet deployments, the release of IBM's WebSphere translation server, the demise of Logos, and the acquisition of MT by localization companies such as SDL and Bowne Global Solutions are only a few of the events of the period. The landscape of the MT world is radically changed after decades of stability and, in some cases, stagnation.

SYSTRAN's rearchitecture introduces another change to the MT landscape. Although none of the previous generation of MT systems has completely disappeared, the lineup of MT systems that are still being marketed for enterprise, Internet, and retail applications has been reduced at least temporarily because MT systems are being acquired by globalization and localization companies.

The sale of the Transcend technology to SDL in February was a bellwether of things to come. The Barcelona system was acquired by Bowne Global Solutions, and Lionbridge has partnered with Sail Labs to deploy, develop, and comarket NLP technologies and services. SDL and Bowne Global Solutions appear to have plans for continued marketing of systems to external users, although not necessarily in the retail "shrink-wrap" market. Lionbridge's intentions are less clear.

Figure 1 - Worldwide Internet Users and eCommerce Revenue, 2001
SYSTRAN and the Reinvention of MT

Source: IDC's Internet Commerce Market Model version 7.3, 2002

However, few localization companies have the specialized staff to develop and maintain MT systems. Although some of the staff of the former MT vendors may move with the technology, a slow start seems likely since the acquiring companies will need time to understand and assimilate the new technology, and define new business goals.

The buy-up of systems by localization companies leaves just three independent MT developer/vendors as potential competitors for SYSTRAN: IBM, Sail Labs, and LogoVista. IBM's WebSphere Translation Server, released in January 2001, is the result of many years of linguistic research at the company's Watson Laboratories. With 12 language pairs and a robust architecture, the system is a respectable competitor, though its linguistic resources do not yet match SYSTRAN's. However, IBM has concentrated exclusively on the Enterprise model, marketing its technology for in-house use by corporations with internal translation needs.

Sail Labs' business is primarily Europe centric. The company has substantial linguistic technology and a robust MT product, Comprendium, but most of its revenue is from consulting. IDC expects that to change as Sail Labs refocuses its business, now that it has freed itself of its ties to Lernout & Hauspie.

The LogoVista system for Japanese and Spanish is developed by Language Engineering Corporation (LEC). Several other licensed language pairs are marketed under the LogoVista name. LogoVista is a popular and respected system in Japan, where it is the market leader for Web-browsing applications. The technology has had a lower profile in the United States due largely to its focus on Japanese. LogoVista's recent licensing of several additional European and Asian language pairs will expand its presence in the United States and Europe. Although the company has been most successful in the Web-browsing application, it offers enterprise and desktop translators as well. LogoVista has a modern code base and produces high-quality MT. However, the company will have to rely on the licensors of its European language pairs to make innovations to the translation technology. It will be interesting to observe what market niches the company pursues beyond multilingual Web browsing.

In addition to the independents, there are numerous emerging systems, many are university based. None are positioned to unseat SYSTRAN at the moment, though some promising technologies using example-based and hybrid techniques are approaching commercialization.

Conclusion

With its newly architected system, its persuasive hold on the multilingual Web-browsing market, and its successful incursions into the customer service area, SYSTRAN is poised to achieve a level of success that has not yet been realized by any other MT company and perhaps to redefine the field of uses for automatic translation solutions.

SYSTRAN's stability as a 35-year-old independent MT developer can be leveraged in the current environment of upheaval among MT companies. The company has an opportunity to secure a position in more niches while its competitors adjust to their various transitions. Doing so, however, will require adequate staffing resources and a tolerance for risk that is not characteristic of the company historically. But, the fact that SYSTRAN opted to incur the risk and cost of modernizing its system suggests that the outlook within the company is as altered as its technology. Regardless of whether it expands the focus, SYSTRAN can become a very successful MT provider simply by owning the two market niches it already is in. SYSTRAN will need to move cautiously, given MT's history, and avoid the obvious pitfalls of unrealistic expectations and inappropriate applications for the technology. Having demonstrated the ability to navigate these challenges successfully thus far, there is every reason to expect SYSTRAN will succeed.


Table of Contents - Abstract - Document


Quoting IDC Information and Data: Internal Documents and Presentations - Quoting individual sentences and paragraphs for use in your company's internal communications does not require permission from IDC. The use of large portions or the reproduction of any IDC document in its entirety does require prior written approval and may involve some financial consideration. External Publication - Any IDC Information that is to be used in advertising, press releases, or promotional materials requires prior written approval from the appropriate IDC Vice President or Country Manager. A draft of the proposed document should accompany any such request

Copyright 1994-2002 International Data Corporation.
Reproduction without written permission is completely forbidden.
For copies please contact Cheryl Toffel, (508) 935-4389
 

Copyright © 2005 SYSTRAN S.A. All rights reserved.
Legal Notices | Contact SYSTRAN