lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From KARTHIK SHIVAKUMAR <nskarthi...@gmail.com>
Subject Re: tokenizing text using language analyzer but preserving stopwords if possible
Date Sun, 11 Dec 2011 13:38:22 GMT
Hi

>> tokenize the original foreign text into words

Need to Identify the Appropriate analyzer ( foreign language before
Indexing ...)


with regards
karthik


On Wed, Dec 7, 2011 at 4:57 PM, Avi Rosenschein <arosenschein@gmail.com>wrote:

> On Wed, Dec 7, 2011 at 00:41, Ilya Zavorin <izavorin@caci.com> wrote:
>
> > I need to implement a "quick and dirty" or "poor man's" translation of a
> > foreign language document by looking up each word in a dictionary and
> > replacing it with the English translation. So what I need is to tokenize
> > the original foreign text into words and then access each word, look it
> up
> > and get its translation. However, if possible, I also need to preserve
> > "non-words", i.e. stopwords so that I could replicate them in the output
> > stream without translating. If the latter is not possible then I just
> need
> > to preserve the order of the original words so that their translations
> have
> > the same order in the output.
> >
> > Can I accomplish this using Lucene components? I presume I'd have to
> start
> > by creating an analyzer for the foreign language, but then what? How do I
> > (i) tokenize, (ii) access words in the correct order, (iii) also access
> > non-words if possible?
> >
>
> You can always use something like StandardAnalyzer for the specific
> language, with an empty stopword list (so that no words are treated as
> stopwords). A bit trickier might be dealing with punctuation - depending on
> the analyzer, you might be able to get these to parse as separate tokens.
>
> -- Avi
>
>
> >
> > Thanks much
> >
> >
> > Ilya Zavorin
> >
> >
> >
>



-- 
*N.S.KARTHIK
R.M.S.COLONY
BEHIND BANK OF INDIA
R.M.V 2ND STAGE
BANGALORE
560094*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message