TRANSLATOR

These are the linguistic data for the Apertium English--Catalan machine translator. 
You need apertium-2.0 and lttoolbox-2.0 to use this translator.


To compile the linguistical data simply do:

$ ./configure

to generate a Makefile file and then

$ make

inside of this directory.


TAGGER 

To use this language-pair package with apertium YOU DO NOT NEED TO
RETRAIN THE TAGGER. Probabilities and auxiliary data are provided for
both the en-ca and the ca-en translation directions which should be
acceptable for most applications, and should work even if you change
the dictionaries in a reasonably way.

If for some reason you need to retrain the tagger (for example, you
have made really extensive changes to the dictionaries such as
creating new lexical categories), you have three alternatives:

* To perform a supervised training:

  To this end tagged corpora is provided, but tagged corpora
  (en-tagger-data/en.tagged and ca-tagger-data/ca.tagged) could be
  obsolete for some words. If this is the case, the tagger training 
  program  will show you where the problems are and you will need 
  to solve them by hand. Be sure to solve the problems by modifying 
  ONLY the .tagged file, NEVER the .untagged file that is 
  automatically generated.

  The supervised training is done by typing: 

  make -f en-ca-supervised.make (for the English part-of-speech tagger)
  make -f ca-en-supervised.make (for the Catalan part-of-speech tagger)

* To perform an unsupervised training:

  For this purpose you will need to assemble a large (hundreds of
  thousand of words) plain-text corpus for each language (for example,
  using a robot to harvest text from online newspapers) and put them in
  the proper place, for instance en-tagger-data/en.crp.txt and
  ca-tagger-data/ca.crp.txt. This type of training does not need human
  intervention but, as expected, results will be less adequate than
  those obtained with the supervised training.

  The unsupervised training is done through the iterative Baum-Welch
  algorithm. By default the number of iterations is set to 8, but you
  can change this value by editing the Makefile and changing the
  value of TAGGER_UNSUPERVISED_ITERATIONS.

  The unsupervised training is done by typing:

  make -f en-ca-unsupervised.make (for the English part-of-speech tagger)
  make -f ca-en-unsupervised.make (for the Catalan part-of-speech tagger)

  This is the training method followed to train the Catalan and the English
  part-of-speech taggers.


* To perform an unsupervised training by using target-language
  information and the rest of the modules of the Apertium MT engine:

  To do so you need large plain-text corpora on both languages. Please
  download the apertium-tagger-training-tools package and follow the
  instructions provided there.