lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From KK <dioxide.softw...@gmail.com>
Subject Re: How to support stemming and case folding for english content mixed with non-english content?
Date Thu, 11 Jun 2009 09:04:24 GMT
Thank you very much Yonik. I downloaded the latest Solr build, pulled the
WordDelimiterFilter and used it with the same option as used by Solr default
and it worked like a charm.  Thanks to Robert also.

Thanks,
KK

On Tue, Jun 9, 2009 at 7:01 PM, Yonik Seeley <yonik@lucidimagination.com>wrote:

> I just cut'n'pasted your word into Solr... it worked fine (it didn't
> split the word).
> Make sure you're using the latest from the trunk version of Solr...
> this was fixed since 1.3
>
> http://localhost:8983/solr/select?q=साल&debugQuery=true
> [...]
> <lst name="debug">
>  <str name="rawquerystring">साल</str>
>  <str name="querystring">साल</str>
>  <str name="parsedquery">text:साल</str>
>  <str name="parsedquery_toString">text:साल</str>
>
> -Yonik
>
>
> On Tue, Jun 9, 2009 at 7:48 AM, KK <dioxide.software@gmail.com> wrote:
> > Hi Robert, I tried a sample code to check whats the reason. The
> > worddelimiterfilter uses isLetter() method to tokenize, and for hindi
> words
> > some parts of word are not actually letters but just part of the word[but
> > that doesnot mean they can be used as word delimiters], since they are
> not
> > letters isLetter() returns false and the word is getting breaked around
> > that. This is some sample code with a hindi word pronounced saal[meaning
> > year in english],
> >
> > import java.lang.String;
> >
> > public class HindiUnicodeTest {
> >    public static void main(String args[]) {
> >        String hindiStr = "साल";
> >        int length = hindiStr.length();
> >        System.out.println("str length " + length);
> >        for (int i=0; i<length; i++) {
> >            System.out.println(hindiStr.charAt(i) + " is " +
> > Character.isLetter(hindiStr.charAt(i)));
> >        }
> >
> >    }
> > }
> >
> > Running this gives this output,
> > str length 3
> > स is true
> > ा is false
> > ल is true
> >
> > As you can see the second one is false, which says that it is not a
> letter
> > but this makes worddelimiterfilter break/tokenize around the word. I even
> > tried to use my custom parser[which I mentioned earlier] and tried to
> print
> > the string that is the output after the query getting parsed, and what I
> > found is that if I send the above hindi word then the query string after
> > being parsed is something like this,
> > Parsed Query string: स ल
> > it essentialy removes the non-letter character[the second one], and it
> seems
> > it treats them as separate and whenever thse two characters appear
> adjacent,
> > they are in th top of result set, also whereever these two letters appers
> in
> > the doc, it says they are part of the result set [and hence highlights
> > them].
> >
> > I hope I made it clear. Do let me if some more information is required.
> >
> > Thanks,
> > KK.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message