The Multi-Stage Matcher, Version 0.7 April 1, 2005 Satanjeev "Bano" Banerjee (satanjeev AT cmu.edu) Alon Lavie (alavie AT cs.cmu.edu) Carnegie Mellon University Pittsburgh, PA, USA 1. Introduction =============== This is software that takes two strings of space separated words as input and aligns matching words between the two strings. Alignment is done over several stages, where each stage uses different criteria to find candidate matching tokens from the two strings to align. Supported criteria are "exact", "porter_stem", "wn_stem" and "wn_synonymy" (details below). 2. Code Organization ==================== The software is organized in modules. The overall algorithm is divided into two main parts, the matching algorithm that returns candidate token matches between tokens in the two strings, and the aligning algorithm. There are several implemented matching algorithms, each in a Perl module of its own: exact.pm: Returns tokens from the two strings that are exact matches of each other. porter_stem.pm: Returns tokens from the two strings that are matches of each other after being stemmed using the Porter stemmer. wn_stem.pm: Same as porter_stem, but stemming is done using WordNet. wn_synonymy.pm: Returns for each token in the second string, the first token (if any) going left to right in the first string such that the two tokens share at least one synset in WordNet. Given candidate matches between tokens in the two strings, the algorithm to actually construct an alignment between the two strings is implemented in the perl module mStageMatcher.pm. Program standAloneMatcher.pl includes the mStageMatcher.pm and uses it to match and align two sentences contained in an input text file. This program shows how to use mStageMatcher.pm from inside a program. 3. How to Run standAloneMatcher.pl ================================== One or more of the matching modules may be used in any order to run the program. To run the program with only exact match, run it like so: perl standAloneMatcher.pl input.txt exact The input file (input.txt in the above example) should have the two strings of words, each on a line of its own - the second string will be aligned to the first one. The output format is as follows: Line 1: (# of stages) Line 2: (# of matched words in stage 1) (# of flips needed in aligning words in stage 1) Line 3: (# of matched words in stage 2) (# of flips needed in aligning words in stage 2) . . . Line n: (# of chunks) (average chunk length) To run the program with only porter stemming, run it like so: perl standAloneMatcher.pl input.txt porter_stem To run it with first the exact and then the wn_stem, run it like so: perl standAloneMatcher.pl input.txt exact wn_stem Use the --details flag to get the actual final alignment: perl standAloneMatcher.pl --details input.txt [Note: The WordNet loading module takes a once-per-program loading time of about 3 seconds on a 2.4 GHz 1GB RAM machine]. 4. Licensing: ============= METEOR is distributed under the following license: License Start: Carnegie Mellon University Copyright (c) 2004 All Rights Reserved. Permission is hereby granted, free of charge, to use and distribute this software and its documentation without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of this work, and to permit persons to whom this work is furnished to do so, subject to the following conditions: 1. The code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Any modifications must be clearly marked as such. 3. Original authors' names are not deleted. 4. The authors' names are not used to endorse or promote products derived from this software without specific prior written permission. CARNEGIE MELLON UNIVERSITY AND THE CONTRIBUTORS TO THIS WORK DISCLAIM ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL CARNEGIE MELLON UNIVERSITY NOR THE CONTRIBUTORS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. Author: Satanjeev "Bano" Banerjee satanjeev@cmu.edu Author: Alon Lavie alavie@cs.cmu.edu License End.