The METEOR MT Evaluation System, Version 0.6 May 3rd, 2007 Abhaya Agarwal (abhayaa AT cs dot cmu dot edu) Satanjeev "Bano" Banerjee (satanjeev AT cmu.edu) Alon Lavie (alavie AT cs.cmu.edu) Carnegie Mellon University Pittsburgh, PA, USA 1. Introduction: ================ METEOR is a system that automatically evaluates the output of machine translation engines by comparing to them to (one or more) reference translations. For a given pair of hypothesis and reference strings, the evaluation proceeds in a sequence of stages, with different criteria being used at each stage to find and score unigram matches. By default, at the first stage all exact matches are detected between the two strings, while in the second stage the words not matched in the first stage are stemmed using the Porter stemmer and then matches are found between these stemmed words. The matching system is written in Perl, and each matching stage is implemented as a separate Perl module. In addition to the two default matching modules (exact matching and stemmed matching), a WordNet based stemmed matching module and a WordNet based synonym matching module are also provided with this distribution. METEOR can be run with the default modules, or the user can override the defaults, and use one or more of the given modules in any order of preference. Further, the user can write his own matching module and plug it into the generic matching system. METEOR's input file format is exactly the same as those of Bleu and NIST's Machine Translation Evaluation system. Thus all translation data that can be evaluated using Bleu (such as the TIDES data) can also be directly evaluated using METEOR. METEOR supports evaluation of MT outputs in languages other than English. Currently supported languages are French, German, Spanish and Czech. Details are provided in next section. 2. How to Run METEOR: ===================== The default way to run the system is as follows: perl meteor.pl -s -t -r For example, using the sample files included with this distribution, you can run as follows: perl meteor.pl -s SAMPLE_SYSTEM -t test.sgm -r ref.sgm By default: * The language is assumed to be English * The modules being run are "exact" and "porter_stem", in that order * Only output is an aggregate score is printed on the standard output. The meteor.pl program needs to access the various Perl modules (.pm files) - the simplest way to do this is to always run meteor.pl from the current directory, and have the modules be in that same directory. If however you need to put the program and the perl modules in a different directory than your current working directory, you have two options: 1. Either copy the .pm files into Perl's standard library directories. Typically these are within Perl's install directory: Perl/lib and Perl/site/lib. 2. If you can't or don't want to put .pm files in the above standard Perl library directories, you can put them in a different directory, and set the environment variable "PERL5LIB" to the path to that directory. This has been tested to work on machines running Windows XP/Perl 5.8.3 and Fedora Core 5/Perl 5.8.8. Meteor can be run with various options which are described in the following sections. Languages: --lang ---------- Meteor supports evaluation of MT output in following languages: Language Available Modules Czech (cs) (exact) English (en) (all) French (fr) (exact, porter_stem) German (de) (exact, porter_stem) Spanish (es) (exact, porter_stem) Input is assumed to be in UTF-8 encoding. Modules: --modules -------- Meteor currently supports 4 modules: exact matching using surface forms porter_stem matching using stems obtained from porter stemmer wn_stem matching using stems obtained from WordNet stemmer wn_synonymy matching based on synonyms obtained from WordNet Not all modules are available for all languages. Last section on languages lists the modules that are available for each supported language. You can change which modules you want to run the matcher with and the order in which they are applied. Various modules require some external packages to be installed which is detailed below. Exact: To only perform exact matching, run the program as follows: perl meteor.pl -s SAMPLE_SYSTEM -t test.sgm -r ref.sgm --modules "exact" porter_stem: For using porter_stem, you would require following software: * Lingua Snowball stemmers (can be downloaded from http://search.cpan.org/~creamyg/Lingua-Stem-Snowball-0.941/) Once it is installed, for using porter_stem only, run the system as follows: perl meteor.pl -s SAMPLE_SYSTEM -t test.sgm -r ref.sgm --modules "porter_stem" wn_stem, wn_synonymy: The WordNet based modules (wn_stem and wn_synonymy) require following two softwares and have been tested with the versions mentioned alongside: * WordNet version 2.1 (not higher) (can be downloaded from: http://wordnet.princeton.edu) * WordNet::QueryData 1.45 (can be downloaded from: http://search.cpan.org/dist/WordNet-QueryData) Once these pieces of software are installed in their default locations, you can use all the modules like so: perl meteor.pl -s SAMPLE_SYSTEM -t test.sgm -r ref.sgm --modules "exact porter_stem wn_synonymy" Note: When using wn_synonymy module, using wn_stem is not necessary. Stop List: --stop ---------- The words in the stoplist file are used as stop-words which are removed from both the hypothesis and reference strings. For example, if you have your own stop list in a file called "custom_stop_list.txt",run the system as follows: perl meteor.pl -s SAMPLE_SYSTEM -t test.sgm -r ref.sgm --stop custom_stop_list.txt Normalization: --keepPunctuation -------------- By default, during normalization everything nonalphanumeric is removed from the data. However if --keepPunctuation is specified, the punctuation is retained during normalization. This is closer to the normalization used in BLEU and NIST's MTEVAL script. However be aware that this can potentially hit the runtime if the data contains long sentences and lots of punctuation. In our experiments, no significant increase in correlation with human judgements was observed when using this option. 3. Input/Output Format of METEOR: ================================= All the input is assumed to be in UTF-8 encoding and all the output is also generated in UTF-8 encoding. Input Formats: [--nBest] -------------- Meteor now supports two input formats. Default: The default as mentioned above, is the same format used by the BLUE and NIST evaluation scripts. In this mode, METEOR expects one hypothesis per "segment", but can handle multiple reference translations for each segment. Segments should be grouped into documents. METEOR will sequentially compare the hypothesis for each segment of each document in the test file with all the references for the corresponding segment in the corresponding document in the reference file. For each such comparison, METEOR calculates the number of matches found and the number of chunks found. Once all segments have been scored, overall statistics are output, including overall precision, recall, 1-factor, fmean, penalty and score. For details about these matrics, please refer to [Banerjee & Lavie,2005],[Lavie et al, 2004]. N-Best: In this mode, METEOR accepts one or more translation hypothesis per segment and for each one of them computes the statistics as described above. A sample n-best test file is included with the distribution to demonstrate the format expected. Note that in this case no aggregate statistics are computed for the whole system. Output Formats: [-outFile [--plainOutput]] --------------- By default, METEOR only prints the overall system stats to the standard output. If an output file is specified, it writes out a copy of input test file with each segment annotated with its corresponding METEOR score and the reference id with which that score was obtained. plainOutput: With this option, the output file will contain only the scores, one per line, in the following format: ::[::N-bestRank] score This is useful when another script has to process the scores generated by the METEOR. Sample output files generated by METEOR are present in the sample-input-output directory. 4. The mStageMatcher Module: ============================ METEOR performs its stage-by-stage matching using the mStageMatcher.pm perl module. Specifically, the METEOR program itself simply reads in all the inputs (commandline arguments, hypothesis/reference translations, etc) and feeds each hypothesis/reference pair to mStageMatcher.pm to perform the stage-by-stage matching algorithm using the input (or default) matching modules. For details about the matching algorithm, please refer [Banerjee & Lavie,2005]. This mStageMatcher module can be used outside of METEOR. This distribution provides the standAloneMatcher.pl program that shows how to use mStageMatcher.pm outside METEOR. Read README_mStageMatcher.txt for more details. 5. Licensing: ============= METEOR is distributed under the following license: License Start: Carnegie Mellon University Copyright (c) 2004 All Rights Reserved. Permission is hereby granted, free of charge, to use and distribute this software and its documentation without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of this work, and to permit persons to whom this work is furnished to do so, subject to the following conditions: 1. The code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Any modifications must be clearly marked as such. 3. Original authors' names are not deleted. 4. The authors' names are not used to endorse or promote products derived from this software without specific prior written permission. CARNEGIE MELLON UNIVERSITY AND THE CONTRIBUTORS TO THIS WORK DISCLAIM ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL CARNEGIE MELLON UNIVERSITY NOR THE CONTRIBUTORS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. Author: Satanjeev "Bano" Banerjee satanjeev AT cmu.edu Author: Alon Lavie alavie AT cs.cmu.edu Author: Abhaya Agarwal abhayaa AT cs.cmu.edu License End. 6. Acknowledgements: ==================== The following researchers have contributed to the implementation of the METEOR system (all at Carnegie Mellon University): Rachel Reynolds Kenji Sagae Jeremy Naman Shyamsundar Jayaraman 7. References: ============== [Lavie & Agarwal,2007] 2007, Lavie, A., A. Agarwal. "METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments", To appear in Proceedings of Workshop on Statistical Machine Translation at the 45th Annual Meeting of the Association of Computational Linguistics (ACL-2007), Prague, June 2007. [Banerjee & Lavie,2005] 2005,Banerjee, S. and A. Lavie, "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments", Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43th Annual Meeting of the Association of Computational Linguistics (ACL-2005), Ann Arbor, Michigan, June 2005. [Lavie et al, 2004] 2004, Lavie, A., K. Sagae and S. Jayaraman. "The Significance of Recall in Automatic Metrics for MT Evaluation". In Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-2004), Washington, DC, September 2004.