lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avi Rosenschein <arosensch...@gmail.com>
Subject Re: tokenizing text using language analyzer but preserving stopwords if possible
Date Wed, 07 Dec 2011 11:27:55 GMT
On Wed, Dec 7, 2011 at 00:41, Ilya Zavorin <izavorin@caci.com> wrote:

> I need to implement a "quick and dirty" or "poor man's" translation of a
> foreign language document by looking up each word in a dictionary and
> replacing it with the English translation. So what I need is to tokenize
> the original foreign text into words and then access each word, look it up
> and get its translation. However, if possible, I also need to preserve
> "non-words", i.e. stopwords so that I could replicate them in the output
> stream without translating. If the latter is not possible then I just need
> to preserve the order of the original words so that their translations have
> the same order in the output.
>
> Can I accomplish this using Lucene components? I presume I'd have to start
> by creating an analyzer for the foreign language, but then what? How do I
> (i) tokenize, (ii) access words in the correct order, (iii) also access
> non-words if possible?
>

You can always use something like StandardAnalyzer for the specific
language, with an empty stopword list (so that no words are treated as
stopwords). A bit trickier might be dealing with punctuation - depending on
the analyzer, you might be able to get these to parse as separate tokens.

-- Avi


>
> Thanks much
>
>
> Ilya Zavorin
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message