lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From weidong sun <lmcw...@gmail.com>
Subject Re: Question wrt Lucene analyzer for different language
Date Thu, 14 May 2009 15:52:26 GMT
Because the exactly same reason, our assumption is that users' profile
information is mixed with only that particular language and English. And
we'll have to use the analyzer for that particular language to do the
indexing and searching. And that's the reason why I asked this analyzer
question. :-)

On Thu, May 14, 2009 at 11:27 AM, Uwe Schindler <uwe@thetaphi.de> wrote:

> There are two problems:
>
> a) Currently there is no such analyzer (I have the problem, too, I would
> also like to autodetect the language from a text like M$ Word does and
> switch the analyzers).
> b) If such an autodetect analyzer exists, you will have a problem on the
> searching side, because you should almost always use the same anylyzer on
> the indexing and search side. The problem is that search queries are
> normally very short and autodetection is hardly possible. If somebody now
> enters something parsed by the query parser using this
> auto-language-analyzer, the detection may fail and the wrongly-stemmed
> analyzer tokens will hit no results.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: weidong sun [mailto:lmcwesu@gmail.com]
> > Sent: Thursday, May 14, 2009 5:19 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Question wrt Lucene analyzer for different language
> >
> > Thanks for the suprising quick response. :-)
> >
> > What I mean "correctly" here is that the specific analyzer can tokenize a
> > text mixed with English and that sepcfic langauge, for example, "12345
> > ????"
> > or "????Text???" (where '?' is a character of that specific language and
> > "12345" and "Text" is english character) to have "12345" and "Text"
> > treated
> > as token and indexed as well.
> > BTW, I don't see a needs for stemming it by far since the information our
> > project encountered is just user's profile info.
> >
> > For the perticular ChineseAnalyzer,  Can it do that?
> >
> >
> > On Thu, May 14, 2009 at 10:37 AM, Erick Erickson
> > <erickerickson@gmail.com>wrote:
> >
> > > No. What is "correctly"? Are you stemming? in which case using thesame
> > > analyzer on different languages will not work.
> > >
> > > This topic have been discussed on the user list frequently, so if you
> > > searched
> > > that archive (see: http://wiki.apache.org/lucene-
> > java/MailingListArchives)
> > > you'd find a wealth of information quickly...
> > >
> > > Best
> > > Erick
> > >
> > > On Thu, May 14, 2009 at 10:11 AM, weidong sun <lmcwesu@gmail.com>
> wrote:
> > >
> > > > Hello,
> > > >
> > > > I am a newbie in Lucene world. I might ask some obvious question
> which
> > > > unfortunately I don't know the answer. Please help me 'grow'.
> > > >
> > > > We have a project intend to use Lucene search engine for search some
> > > user's
> > > > info stored our system. The user info might not be in English even it
> > > will
> > > > be stored in UTF-8 encoding.
> > > >
> > > > My question is, if I use one particular Lucene analyzer for a
> language
> > > > other
> > > > than English (e.g. ChineseAnalyzer or ArabicAnalyzer), can it still
> > able
> > > to
> > > > handle it correctly if user info is mixed with English
> character/word?
> > > >
> > > > Really appreciated with any answers.
> > > >
> > > > :-)
> > > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message