lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From KK <dioxide.softw...@gmail.com>
Subject Re: Hindi, diacritics and search results
Date Mon, 13 Jul 2009 10:36:26 GMT
Apart from using WhiteSpaceAnalyzer which will tokenize words based on
spaces, you can try writing a simple custom analyzer which'll a bit more. I
did the following for handling Indic languages intermingled with English
content,

/**
 * Analyzer for Indian language.
 */
public class IndicAnalyzerIndex extends Analyzer {
    public TokenStream tokenStream(String fieldName, Reader reader) {
        TokenStream ts = new WhitespaceTokenizer(reader);
        /**
        * @param ts, token stream
        * @param generateWordParts If 1, causes parts of words to be
generated: "PowerShot" => "Power" "Shot"
        * @param generateNumberParts If 1, causes number subwords to be
generated: "500-42" => "500" "42"
        * @param catenateWords  1, causes maximum runs of word parts to be
catenated: "wi-fi" => "wifi"
        * @param catenateNumbers If 1, causes maximum runs of number parts
to be catenated: "500-42" => "50042"
        * @param catenateAll If 1, causes all subword parts to be catenated:
"wi-fi-4000" => "wifi4000"
        */
        ts = new WordDelimiterFilter(ts, 1, 1, 1, 1, 0);
        ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
        ts = new LowerCaseFilter(ts);
        ts = new PorterStemFilter(ts);
        return ts;
    }
}

The above is for indexing, for querying you can just use the following
values for the worddelimiterfilter constructor, keeping the rest of the
things same,
ts = new WordDelimiterFilter(ts, 1, 1, 0, 0, 0);

I pulled the "worddelimterfilter" class from Solr nightly build, as nothing
as such is available in Lucene, AFAIK.

In my case its working perfectly fine for all indian languages mixed with
english content. As you can see for english it applies the usual process of
stemming/stop-word-removal etc. Try it out and do let us know if you face
any issues.

Thanks,
KK.

On Sat, Jul 11, 2009 at 8:05 AM, Robert Muir <rcmuir@gmail.com> wrote:

> there is really no default in lucene
>
> a good start for hindi would be to try WhitespaceAnalyzer.
>
> On Fri, Jul 10, 2009 at 9:13 PM, OBender Hotmail<osya_bender@hotmail.com>
> wrote:
> > I'm using default analyzer. Actually one that is set by default by
> Compass framework but I assume it is the same that would be used in Lucene
> by default.
> > Which one should I use?
> >
> > -----Original Message-----
> > From: Robert Muir [mailto:rcmuir@gmail.com]
> > Sent: Friday, July 10, 2009 6:13 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Hindi, diacritics and search results
> >
> > Which analyzer in particular are you using?
> >
> > Its probably not doing what you want for hindi. These "diacritics" are
> > important (vowels, etc).
> >
> >
> > On Fri, Jul 10, 2009 at 3:10 PM, OBender<osya_bender@hotmail.com> wrote:
> >> Hi All,
> >>
> >>
> >>
> >> I'm using the default setup of lucene (no custom analyzers configured)
> and
> >> came across the following issue:
> >>
> >> In Hindi if there is a letter with a diacritic in a phrase lucene will
> find
> >> the phrase with this letter even if the search string is for the letter
> >> without a diacritics.
> >>
> >> Is this an expected behavior? Maybe this is standard for all languages
> with
> >> letters that have diacritics?
> >>
> >>
> >>
> >> From pure byte standpoint I can see the logic, the letter with
> diacritics
> >> takes 6 bytes (E0 A4 95 E0 A5 87) and the single letter takes  3 (E0 A4
> 95)
> >> so if I search for *some_letter* where some letter has code (E0 A4 95)
> >> lucene finds the "phrase" (E0 A4 95 E0 A5 87) that includes that letter.
> >>
> >>
> >>
> >> Any comments much appreciated.
> >>
> >>
> >>
> >> Thanks.
> >>
> >>
> >>
> >>
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> > Checked by AVG - www.avg.com
> > Version: 8.5.375 / Virus Database: 270.13.0/2209 - Release Date: 07/10/09
> 17:57:00
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message