lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From weidong sun <lmcw...@gmail.com>
Subject Re: Question wrt Lucene analyzer for different language
Date Thu, 14 May 2009 15:18:52 GMT
Thanks for the suprising quick response. :-)

What I mean "correctly" here is that the specific analyzer can tokenize a
text mixed with English and that sepcfic langauge, for example, "12345 ????"
or "????Text???" (where '?' is a character of that specific language and
"12345" and "Text" is english character) to have "12345" and "Text" treated
as token and indexed as well.
BTW, I don't see a needs for stemming it by far since the information our
project encountered is just user's profile info.

For the perticular ChineseAnalyzer,  Can it do that?


On Thu, May 14, 2009 at 10:37 AM, Erick Erickson <erickerickson@gmail.com>wrote:

> No. What is "correctly"? Are you stemming? in which case using thesame
> analyzer on different languages will not work.
>
> This topic have been discussed on the user list frequently, so if you
> searched
> that archive (see: http://wiki.apache.org/lucene-java/MailingListArchives)
> you'd find a wealth of information quickly...
>
> Best
> Erick
>
> On Thu, May 14, 2009 at 10:11 AM, weidong sun <lmcwesu@gmail.com> wrote:
>
> > Hello,
> >
> > I am a newbie in Lucene world. I might ask some obvious question which
> > unfortunately I don't know the answer. Please help me 'grow'.
> >
> > We have a project intend to use Lucene search engine for search some
> user's
> > info stored our system. The user info might not be in English even it
> will
> > be stored in UTF-8 encoding.
> >
> > My question is, if I use one particular Lucene analyzer for a language
> > other
> > than English (e.g. ChineseAnalyzer or ArabicAnalyzer), can it still able
> to
> > handle it correctly if user info is mixed with English character/word?
> >
> > Really appreciated with any answers.
> >
> > :-)
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message