lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Question wrt Lucene analyzer for different language
Date Thu, 14 May 2009 16:22:19 GMT
I would say in general, yes.

when i say 'change arabic text', I mean the arabic analyzer will standardize
and stem arabic words. but it won't modify any of your english words.

and no, there is no case in arabic. this is why if you are handling mixed
arabic/english text I recommend creating a custom analyzer that does some
basics with the english part as well, such as lowercasefilter.

On Thu, May 14, 2009 at 11:26 AM, weidong sun <lmcwesu@gmail.com> wrote:

> Thanks for the quick answer. :-)
>
> So  can I say, for ArabicAnalyzer, generally it can tokenize the mixed
> content with Arabic and English? :-)
>
> I am not really familiar with Arabic language. What do you mean for "change
> Arabic tokens"? Does Arabic has something like upper/lower case as English
> does?
>
>
> On Thu, May 14, 2009 at 10:47 AM, Robert Muir <rcmuir@gmail.com> wrote:
>
> > in the case of ArabicAnalyzer it will only change Arabic tokens, and will
> > leave english words as-is (it will not convert them to lowercase or
> > anything
> > like that)
> >
> > so if you want to have good Arabic and English behavior you would want to
> > create a custom analyzer that looks like Arabic analyzer but also invokes
> > lowercasefilter, perhaps also some english stemmer, etc etc.
> >
> > On Thu, May 14, 2009 at 10:11 AM, weidong sun <lmcwesu@gmail.com> wrote:
> >
> > > Hello,
> > >
> > > I am a newbie in Lucene world. I might ask some obvious question which
> > > unfortunately I don't know the answer. Please help me 'grow'.
> > >
> > > We have a project intend to use Lucene search engine for search some
> > user's
> > > info stored our system. The user info might not be in English even it
> > will
> > > be stored in UTF-8 encoding.
> > >
> > > My question is, if I use one particular Lucene analyzer for a language
> > > other
> > > than English (e.g. ChineseAnalyzer or ArabicAnalyzer), can it still
> able
> > to
> > > handle it correctly if user info is mixed with English character/word?
> > >
> > > Really appreciated with any answers.
> > >
> > > :-)
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>



-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message