lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Question wrt Lucene analyzer for different language
Date Thu, 14 May 2009 15:27:15 GMT
There are two problems:

a) Currently there is no such analyzer (I have the problem, too, I would
also like to autodetect the language from a text like M$ Word does and
switch the analyzers).
b) If such an autodetect analyzer exists, you will have a problem on the
searching side, because you should almost always use the same anylyzer on
the indexing and search side. The problem is that search queries are
normally very short and autodetection is hardly possible. If somebody now
enters something parsed by the query parser using this
auto-language-analyzer, the detection may fail and the wrongly-stemmed
analyzer tokens will hit no results.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: weidong sun [mailto:lmcwesu@gmail.com]
> Sent: Thursday, May 14, 2009 5:19 PM
> To: java-user@lucene.apache.org
> Subject: Re: Question wrt Lucene analyzer for different language
> 
> Thanks for the suprising quick response. :-)
> 
> What I mean "correctly" here is that the specific analyzer can tokenize a
> text mixed with English and that sepcfic langauge, for example, "12345
> ????"
> or "????Text???" (where '?' is a character of that specific language and
> "12345" and "Text" is english character) to have "12345" and "Text"
> treated
> as token and indexed as well.
> BTW, I don't see a needs for stemming it by far since the information our
> project encountered is just user's profile info.
> 
> For the perticular ChineseAnalyzer,  Can it do that?
> 
> 
> On Thu, May 14, 2009 at 10:37 AM, Erick Erickson
> <erickerickson@gmail.com>wrote:
> 
> > No. What is "correctly"? Are you stemming? in which case using thesame
> > analyzer on different languages will not work.
> >
> > This topic have been discussed on the user list frequently, so if you
> > searched
> > that archive (see: http://wiki.apache.org/lucene-
> java/MailingListArchives)
> > you'd find a wealth of information quickly...
> >
> > Best
> > Erick
> >
> > On Thu, May 14, 2009 at 10:11 AM, weidong sun <lmcwesu@gmail.com> wrote:
> >
> > > Hello,
> > >
> > > I am a newbie in Lucene world. I might ask some obvious question which
> > > unfortunately I don't know the answer. Please help me 'grow'.
> > >
> > > We have a project intend to use Lucene search engine for search some
> > user's
> > > info stored our system. The user info might not be in English even it
> > will
> > > be stored in UTF-8 encoding.
> > >
> > > My question is, if I use one particular Lucene analyzer for a language
> > > other
> > > than English (e.g. ChineseAnalyzer or ArabicAnalyzer), can it still
> able
> > to
> > > handle it correctly if user info is mixed with English character/word?
> > >
> > > Really appreciated with any answers.
> > >
> > > :-)
> > >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message