lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yonik Seeley <yo...@lucidimagination.com>
Subject Re: How to support stemming and case folding for english content mixed with non-english content?
Date Tue, 09 Jun 2009 13:31:24 GMT
I just cut'n'pasted your word into Solr... it worked fine (it didn't
split the word).
Make sure you're using the latest from the trunk version of Solr...
this was fixed since 1.3

http://localhost:8983/solr/select?q=साल&debugQuery=true
[...]
<lst name="debug">
  <str name="rawquerystring">साल</str>
  <str name="querystring">साल</str>
  <str name="parsedquery">text:साल</str>
  <str name="parsedquery_toString">text:साल</str>

-Yonik


On Tue, Jun 9, 2009 at 7:48 AM, KK <dioxide.software@gmail.com> wrote:
> Hi Robert, I tried a sample code to check whats the reason. The
> worddelimiterfilter uses isLetter() method to tokenize, and for hindi words
> some parts of word are not actually letters but just part of the word[but
> that doesnot mean they can be used as word delimiters], since they are not
> letters isLetter() returns false and the word is getting breaked around
> that. This is some sample code with a hindi word pronounced saal[meaning
> year in english],
>
> import java.lang.String;
>
> public class HindiUnicodeTest {
>    public static void main(String args[]) {
>        String hindiStr = "साल";
>        int length = hindiStr.length();
>        System.out.println("str length " + length);
>        for (int i=0; i<length; i++) {
>            System.out.println(hindiStr.charAt(i) + " is " +
> Character.isLetter(hindiStr.charAt(i)));
>        }
>
>    }
> }
>
> Running this gives this output,
> str length 3
> स is true
> ा is false
> ल is true
>
> As you can see the second one is false, which says that it is not a letter
> but this makes worddelimiterfilter break/tokenize around the word. I even
> tried to use my custom parser[which I mentioned earlier] and tried to print
> the string that is the output after the query getting parsed, and what I
> found is that if I send the above hindi word then the query string after
> being parsed is something like this,
> Parsed Query string: स ल
> it essentialy removes the non-letter character[the second one], and it seems
> it treats them as separate and whenever thse two characters appear adjacent,
> they are in th top of result set, also whereever these two letters appers in
> the doc, it says they are part of the result set [and hence highlights
> them].
>
> I hope I made it clear. Do let me if some more information is required.
>
> Thanks,
> KK.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message