lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Zavorin <>
Subject tokenizing text using language analyzer but preserving stopwords if possible
Date Tue, 06 Dec 2011 22:41:00 GMT
I need to implement a "quick and dirty" or "poor man's" translation of a foreign language document
by looking up each word in a dictionary and replacing it with the English translation. So
what I need is to tokenize the original foreign text into words and then access each word,
look it up and get its translation. However, if possible, I also need to preserve "non-words",
i.e. stopwords so that I could replicate them in the output stream without translating. If
the latter is not possible then I just need to preserve the order of the original words so
that their translations have the same order in the output.

Can I accomplish this using Lucene components? I presume I'd have to start by creating an
analyzer for the foreign language, but then what? How do I (i) tokenize, (ii) access words
in the correct order, (iii) also access non-words if possible?

Thanks much

Ilya Zavorin

View raw message