lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Hindi, diacritics and search results
Date Fri, 10 Jul 2009 22:13:20 GMT
Which analyzer in particular are you using?

Its probably not doing what you want for hindi. These "diacritics" are
important (vowels, etc).


On Fri, Jul 10, 2009 at 3:10 PM, OBender<osya_bender@hotmail.com> wrote:
> Hi All,
>
>
>
> I'm using the default setup of lucene (no custom analyzers configured) and
> came across the following issue:
>
> In Hindi if there is a letter with a diacritic in a phrase lucene will find
> the phrase with this letter even if the search string is for the letter
> without a diacritics.
>
> Is this an expected behavior? Maybe this is standard for all languages with
> letters that have diacritics?
>
>
>
> From pure byte standpoint I can see the logic, the letter with diacritics
> takes 6 bytes (E0 A4 95 E0 A5 87) and the single letter takes  3 (E0 A4 95)
> so if I search for *some_letter* where some letter has code (E0 A4 95)
> lucene finds the "phrase" (E0 A4 95 E0 A5 87) that includes that letter.
>
>
>
> Any comments much appreciated.
>
>
>
> Thanks.
>
>
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message