lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Which analyzer to use for non-english unicoded text?
Date Sun, 24 May 2009 17:14:53 GMT
I don't think there's anything you can use out of the box, but if you
search for the mail thread (see serchable archives) for a thread
titled "Hebrew and Hindi analyzers" you might find something
useful.

Not much help I know, but perhaps a place to start.

And yes, you should use the same analyzer for indexing and
searching if at all possible. The reason is that the job of an
analyzer is to break the incoming stream up into meaningful
units (usually words). You wouldn't want your analyzer used
in indexing to, say, remove stopwords then use a different analyzer
to search that did NOT remove stopwords (or lowercase, or stem, of...).

And certainly many people have indexed and searched non-English
documents, and many have been contributed the resultant
Analyzers back to the Lucene community. If you find that you have to
write your own, please consider contributing.

HTH
Erick


On Sat, May 23, 2009 at 2:23 AM, KK <dioxide.software@gmail.com> wrote:

> Hi All,
> I've been trying to index some non-english [Indian languages] in unicode
> utf-8. For all these languages we don't have any stemmer or tokenizers etc.
> To keep the searching simple I'ld like to be able to do exact word
> searches/matches as a first step. I'ld like to know which will be the
> simplest yet working analyzer to use for both indexing as well as
> searhing[lucene wiki says both should be same, else you might not get
> search
> results, right?]
>
> Many a people must have done indexing for non-english text for which there
> is no standard analyzers. I request them to give me ideas on this. Along
> with this I would also like to do hit highlighting irrespective of
> language.
> Ideas on this will be equally helpful.
>
> Is simpleAnalyzer() good enough for indexing and searching?
>
> Thanks,
> KK
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message