I need to implement a "quick and dirty" or "poor man's" translation of a foreign language document
by looking up each word in a dictionary and replacing it with the English translation. So
what I need is to tokenize the original foreign text into words and then access each word,
look it up and get its translation. However, if possible, I also need to preserve "non-words",
i.e. stopwords so that I could replicate them in the output stream without translating. If
the latter is not possible then I just need to preserve the order of the original words so
that their translations have the same order in the output.
Can I accomplish this using Lucene components? I presume I'd have to start by creating an
analyzer for the foreign language, but then what? How do I (i) tokenize, (ii) access words
in the correct order, (iii) also access non-words if possible?
Thanks much
Ilya Zavorin
|