lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From KK <dioxide.softw...@gmail.com>
Subject Re: How to support stemming and case folding for english content mixed with non-english content?
Date Tue, 09 Jun 2009 11:48:50 GMT
Hi Robert, I tried a sample code to check whats the reason. The
worddelimiterfilter uses isLetter() method to tokenize, and for hindi words
some parts of word are not actually letters but just part of the word[but
that doesnot mean they can be used as word delimiters], since they are not
letters isLetter() returns false and the word is getting breaked around
that. This is some sample code with a hindi word pronounced saal[meaning
year in english],

import java.lang.String;

public class HindiUnicodeTest {
    public static void main(String args[]) {
        String hindiStr = "साल";
        int length = hindiStr.length();
        System.out.println("str length " + length);
        for (int i=0; i<length; i++) {
            System.out.println(hindiStr.charAt(i) + " is " +
Character.isLetter(hindiStr.charAt(i)));
        }

    }
}

Running this gives this output,
str length 3
स is true
ा is false
ल is true

As you can see the second one is false, which says that it is not a letter
but this makes worddelimiterfilter break/tokenize around the word. I even
tried to use my custom parser[which I mentioned earlier] and tried to print
the string that is the output after the query getting parsed, and what I
found is that if I send the above hindi word then the query string after
being parsed is something like this,
Parsed Query string: स ल
it essentialy removes the non-letter character[the second one], and it seems
it treats them as separate and whenever thse two characters appear adjacent,
they are in th top of result set, also whereever these two letters appers in
the doc, it says they are part of the result set [and hence highlights
them].

I hope I made it clear. Do let me if some more information is required.

Thanks,
KK.

On Mon, Jun 8, 2009 at 3:34 PM, Robert Muir <rcmuir@gmail.com> wrote:

> KK can you give me an example of some indian text for which it is doing
> this?
>
> Thanks!
>
> On Mon, Jun 8, 2009 at 1:03 AM, KK<dioxide.software@gmail.com> wrote:
> > Hi Robert,
> > The problem is that worddelimiterfilter is doing its job for english
> content
> > but for non-english indian content which are unicoded it highlights the
> > searched word but alongwith that it also highlights the characters of
> that
> > word which was not hapenning without worddelimfilter, thats my concern.
> Say
> > for example I searched for a hindi word say "xyz ab" [assume these are in
> > hindi]  then in the search results it highlights these words but it also
> > highlights x/y/z/a/b whereever these letters appear which is obiviously
> > sounds bad. it should only highlight words not the letters therein. I
> hope I
> > made it clear. What could be the reason for this? Any idea on fixing the
> > same.
> >
> > Thanks,
> > KK
> >
> > On Sat, Jun 6, 2009 at 9:45 PM, Robert Muir <rcmuir@gmail.com> wrote:
> >
> >> kk, i haven't had that experience with worddelimiterfilter on indian
> >> languages, is it possible you could provide me an example of how its
> >> creating nuisance?
> >>
> >> On Sat, Jun 6, 2009 at 9:42 AM, KK<dioxide.software@gmail.com> wrote:
> >> > Robert, I tried to use worddelimiterfilter from solr-nightly by
> putting
> >> it
> >> > in my working directory for this project, I set the parameters as you
> >> told
> >> > me. I must accept that its splitting words around those chars[like . @
> >> > etc]but alongwith that its messing with other non-english/unicode
> >> contents
> >> > and thats creating nuisance. I dont want worddelimiterfilter to fiddle
> >> > around with my non-english content.
> >> > This is what I'm doing,
> >> > /**
> >> >  * Analyzer for Indian language.
> >> >  */
> >> > public class IndicAnalyzer extends Analyzer {
> >> >  public TokenStream tokenStream(String fieldName, Reader reader) {
> >> >    TokenStream ts = new WhitespaceTokenizer(reader);
> >> >    ts = new WordDelimiterFilter(ts, 1, 1, 1, 1, 0);
> >> >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> >> >    ts = new LowerCaseFilter(ts);
> >> >    ts = new PorterStemFilter(ts);
> >> >    return ts;
> >> >  }
> >> > }
> >> >
> >> > I've to use the deprecated API for setting 5 values, thats fine, but
> >> somehow
> >> > its messing with unicode content. How to get rid of that? Any thougts?
> It
> >> > seems setting those values is some proper way might fix the problem,
> I'm
> >> not
> >> > sure, though.
> >> >
> >> > Thanks,
> >> > KK.
> >> >
> >> >
> >> > On Fri, Jun 5, 2009 at 7:37 PM, Robert Muir <rcmuir@gmail.com> wrote:
> >> >
> >> >> kk an easier solution to your first problem is to use
> >> >> worddelimiterfilterfactory if possible... you can get an instance of
> >> >> worddelimiter filter from that.
> >> >>
> >> >> thanks,
> >> >> robert
> >> >>
> >> >> On Fri, Jun 5, 2009 at 10:06 AM, Robert Muir<rcmuir@gmail.com>
> wrote:
> >> >> > kk as for your first issue, that WordDelimiterFilter is package
> >> >> > protected, one option is to make a copy of the code and change the
> >> >> > class declaration to public.
> >> >> > the other option is to put your entire analyzer in
> >> >> > 'org.apache.solr.analysis' package so that you can access it...
> >> >> >
> >> >> > for the 2nd issue, yes you need to supply some options to it. the
> >> >> > default options solr applies to type 'text' seemed to work well for
> me
> >> >> > with indic:
> >> >> >
> >> >> > {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1,
> >> >> > generateWordParts=1, catenateAll=0, catenateNumbers=1}
> >> >> >
> >> >> > On Fri, Jun 5, 2009 at 9:12 AM, KK <dioxide.software@gmail.com>
> >> wrote:
> >> >> >>
> >> >> >> Thanks Robert. There is one problem though, I'm able to plugin the
> >> word
> >> >> >> delimiter filter from solr-nightly jar file. When I tried to do
> >> >> something
> >> >> >> like,
> >> >> >>  TokenStream ts = new WhitespaceTokenizer(reader);
> >> >> >>   ts = new WordDelimiterFilter(ts);
> >> >> >>   ts = new PorterStemmerFilter(ts);
> >> >> >>   ...rest as in the last mail...
> >> >> >>
> >> >> >> It gave me an error saying that
> >> >> >>
> >> >> >> org.apache.solr.analysis.WordDelimiterFilter is not public in
> >> >> >> org.apache.solr.analysis; cannot be accessed from outside package
> >> >> >> import org.apache.solr.analysis.WordDelimiterFilter;
> >> >> >>                               ^
> >> >> >> solrSearch/IndicAnalyzer.java:38: cannot find symbol
> >> >> >> symbol  : class WordDelimiterFilter
> >> >> >> location: class solrSearch.IndicAnalyzer
> >> >> >>    ts = new WordDelimiterFilter(ts);
> >> >> >>             ^
> >> >> >> 2 errors
> >> >> >>
> >> >> >> Then i tried to see the code for worddelimitefiter from
> solrnightly
> >> src
> >> >> and
> >> >> >> found that there are many deprecated constructors though they
> require
> >> a
> >> >> lot
> >> >> >> of parameters alongwith tokenstream. I went through the solr wiki
> for
> >> >> >> worddelimiterfilterfactory here,
> >> >> >>
> >> >>
> >>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089
> >> >> >> and say that there also its specified that we've to mention the
> >> >> parameters
> >> >> >> and both are different for indexing and querying.
> >> >> >> I'm kind of stuck here, how do I make use of worddelimiterfilter
> in
> >> my
> >> >> >> custom analyzer, I've to use it anyway.
> >> >> >> In my code I've to make use of worddelimiterfilter and not
> >> >> >> worddelimiterfilterfactory, right? I don't know whats the use of
> the
> >> >> other
> >> >> >> one. Anyway can you guide me getting rid of the above error. And
> yes
> >> >> I'll
> >> >> >> change the order of applying the filters as you said.
> >> >> >>
> >> >> >> Thanks,
> >> >> >> KK.
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Fri, Jun 5, 2009 at 5:48 PM, Robert Muir <rcmuir@gmail.com>
> >> wrote:
> >> >> >>
> >> >> >> > KK, you got the right idea.
> >> >> >> >
> >> >> >> > though I think you might want to change the order, move the
> >> stopfilter
> >> >> >> > before the porter stem filter... otherwise it might not work
> >> >> correctly.
> >> >> >> >
> >> >> >> > On Fri, Jun 5, 2009 at 8:05 AM, KK <dioxide.software@gmail.com>
> >> >> wrote:
> >> >> >> >
> >> >> >> > > Thanks Robert. This is exactly what I did and  its working but
> >> >> delimiter
> >> >> >> > is
> >> >> >> > > missing I'm going to add that from solr-nightly.jar
> >> >> >> > >
> >> >> >> > > /**
> >> >> >> > >  * Analyzer for Indian language.
> >> >> >> > >  */
> >> >> >> > > public class IndicAnalyzer extends Analyzer {
> >> >> >> > >  public TokenStream tokenStream(String fieldName, Reader
> reader)
> >> {
> >> >> >> > >     TokenStream ts = new WhitespaceTokenizer(reader);
> >> >> >> > >    ts = new PorterStemFilter(ts);
> >> >> >> > >    ts = new LowerCaseFilter(ts);
> >> >> >> > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> >> >> >> > >    return ts;
> >> >> >> > >  }
> >> >> >> > > }
> >> >> >> > >
> >> >> >> > > Its able to do stemming/case-folding and supports search for
> both
> >> >> english
> >> >> >> > > and indic texts. let me try out the delimiter. Will update you
> on
> >> >> that.
> >> >> >> > >
> >> >> >> > > Thanks a lot.
> >> >> >> > > KK
> >> >> >> > >
> >> >> >> > > On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rcmuir@gmail.com
> >
> >> >> wrote:
> >> >> >> > >
> >> >> >> > > > i think you are on the right track... once you build your
> >> >> analyzer, put
> >> >> >> > > it
> >> >> >> > > > in your classpath and play around with it in luke and see if
> it
> >> >> does
> >> >> >> > what
> >> >> >> > > > you want.
> >> >> >> > > >
> >> >> >> > > > On Fri, Jun 5, 2009 at 3:19 AM, KK <
> dioxide.software@gmail.com
> >> >
> >> >> wrote:
> >> >> >> > > >
> >> >> >> > > > > Hi Robert,
> >> >> >> > > > > This is what I copied from ThaiAnalyzer @ lucene contrib
> >> >> >> > > > >
> >> >> >> > > > > public class ThaiAnalyzer extends Analyzer {
> >> >> >> > > > >  public TokenStream tokenStream(String fieldName, Reader
> >> reader)
> >> >> {
> >> >> >> > > > >      TokenStream ts = new StandardTokenizer(reader);
> >> >> >> > > > >    ts = new StandardFilter(ts);
> >> >> >> > > > >    ts = new ThaiWordFilter(ts);
> >> >> >> > > > >    ts = new StopFilter(ts,
> StopAnalyzer.ENGLISH_STOP_WORDS);
> >> >> >> > > > >    return ts;
> >> >> >> > > > >  }
> >> >> >> > > > > }
> >> >> >> > > > >
> >> >> >> > > > > Now as you said, I've to use whitespacetokenizer
> >> >> >> > > > > withworddelimitefilter[solr
> >> >> >> > > > > nightly.jar] stop wordremoval, porter stemmer etc , so it
> is
> >> >> >> > something
> >> >> >> > > > like
> >> >> >> > > > > this,
> >> >> >> > > > > public class IndicAnalyzer extends Analyzer {
> >> >> >> > > > >  public TokenStream tokenStream(String fieldName, Reader
> >> reader)
> >> >> {
> >> >> >> > > > >   TokenStream ts = new WhiteSpaceTokenizer(reader);
> >> >> >> > > > >   ts = new WordDelimiterFilter(ts);
> >> >> >> > > > >   ts = new LowerCaseFilter(ts);
> >> >> >> > > > >   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)
> >> //
> >> >> >> > english
> >> >> >> > > > > stop filter, is this the default one?
> >> >> >> > > > >   ts = new PorterFilter(ts);
> >> >> >> > > > >   return ts;
> >> >> >> > > > >  }
> >> >> >> > > > > }
> >> >> >> > > > >
> >> >> >> > > > > Does this sound OK? I think it will do the job...let me
> try
> >> it
> >> >> out..
> >> >> >> > > > > I dont need custom filter as per my requirement, at least
> not
> >> >> for
> >> >> >> > these
> >> >> >> > > > > basic things I'm doing? I think so...
> >> >> >> > > > >
> >> >> >> > > > > Thanks,
> >> >> >> > > > > KK.
> >> >> >> > > > >
> >> >> >> > > > >
> >> >> >> > > > > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <
> >> rcmuir@gmail.com>
> >> >> >> > wrote:
> >> >> >> > > > >
> >> >> >> > > > > > KK well you can always get some good examples from the
> >> lucene
> >> >> >> > contrib
> >> >> >> > > > > > codebase.
> >> >> >> > > > > > For example, look at the DutchAnalyzer, especially:
> >> >> >> > > > > >
> >> >> >> > > > > > TokenStream tokenStream(String fieldName, Reader reader)
> >> >> >> > > > > >
> >> >> >> > > > > > See how it combines a specified tokenizer with various
> >> >> filters?
> >> >> >> > this
> >> >> >> > > is
> >> >> >> > > > > > what
> >> >> >> > > > > > you want to do, except of course you want to use
> different
> >> >> >> > tokenizer
> >> >> >> > > > and
> >> >> >> > > > > > filters.
> >> >> >> > > > > >
> >> >> >> > > > > > On Thu, Jun 4, 2009 at 8:53 AM, KK <
> >> >> dioxide.software@gmail.com>
> >> >> >> > > wrote:
> >> >> >> > > > > >
> >> >> >> > > > > > > Thanks Muir.
> >> >> >> > > > > > > Thanks for letting me know that I dont need language
> >> >> identifiers.
> >> >> >> > > > > > >  I'll have a look and will try to write the analyzer.
> For
> >> my
> >> >> case
> >> >> >> > I
> >> >> >> > > > > think
> >> >> >> > > > > > > it
> >> >> >> > > > > > > wont be that difficult.
> >> >> >> > > > > > > BTW, can you point me to some sample codes/tutorials
> >> writing
> >> >> >> > custom
> >> >> >> > > > > > > analyzers. I could not find something in LIA2ndEdn. Is
> >> >> something
> >> >> >> > > > htere?
> >> >> >> > > > > > do
> >> >> >> > > > > > > let me know.
> >> >> >> > > > > > >
> >> >> >> > > > > > > Thanks,
> >> >> >> > > > > > > KK.
> >> >> >> > > > > > >
> >> >> >> > > > > > >
> >> >> >> > > > > > >
> >> >> >> > > > > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <
> >> >> rcmuir@gmail.com>
> >> >> >> > > > wrote:
> >> >> >> > > > > > >
> >> >> >> > > > > > > > KK, for your case, you don't really need to go to
> the
> >> >> effort of
> >> >> >> > > > > > detecting
> >> >> >> > > > > > > > whether fragments are english or not.
> >> >> >> > > > > > > > Because the English stemmers in lucene will not
> modify
> >> >> your
> >> >> >> > Indic
> >> >> >> > > > > text,
> >> >> >> > > > > > > and
> >> >> >> > > > > > > > neither will the LowerCaseFilter.
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > what you want to do is create a custom analyzer that
> >> works
> >> >> like
> >> >> >> > > > this
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > -WhitespaceTokenizer with WordDelimiterFilter [from
> >> Solr
> >> >> >> > nightly
> >> >> >> > > > > jar],
> >> >> >> > > > > > > > LowerCaseFilter, StopFilter, and PorterStemFilter-
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > Thanks,
> >> >> >> > > > > > > > Robert
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <
> >> >> dioxide.software@gmail.com
> >> >> >> > >
> >> >> >> > > > > wrote:
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > > Thank you all.
> >> >> >> > > > > > > > > To be frank I was using Solr in the begining half
> a
> >> >> month
> >> >> >> > ago.
> >> >> >> > > > The
> >> >> >> > > > > > > > > problem[rather bug] with solr was creation of new
> >> index
> >> >> on
> >> >> >> > the
> >> >> >> > > > fly.
> >> >> >> > > > > > > > Though
> >> >> >> > > > > > > > > they have a restful method for teh same, but it
> was
> >> not
> >> >> >> > > working.
> >> >> >> > > > If
> >> >> >> > > > > I
> >> >> >> > > > > > > > > remember properly one of Solr commiter "Noble
> Paul"[I
> >> >> dont
> >> >> >> > know
> >> >> >> > > > his
> >> >> >> > > > > > > real
> >> >> >> > > > > > > > > name] was trying to help me. I tried many nightly
> >> builds
> >> >> and
> >> >> >> > > > > spending
> >> >> >> > > > > > a
> >> >> >> > > > > > > > > couple of days stuck at that made me think of
> lucene
> >> and
> >> >> I
> >> >> >> > > > switched
> >> >> >> > > > > > to
> >> >> >> > > > > > > > it.
> >> >> >> > > > > > > > > Now after working with lucene which gives you full
> >> >> control of
> >> >> >> > > > > > > everything
> >> >> >> > > > > > > > I
> >> >> >> > > > > > > > > don't want to switch to Solr.[LOL, to me
> Solr:Lucene
> >> is
> >> >> >> > similar
> >> >> >> > > > to
> >> >> >> > > > > > > > > Window$:Linux, its my view only, though]. Coming
> back
> >> to
> >> >> the
> >> >> >> > > > point
> >> >> >> > > > > as
> >> >> >> > > > > > > Uwe
> >> >> >> > > > > > > > > mentioned that we can do the same thing in lucene
> as
> >> >> well,
> >> >> >> > what
> >> >> >> > > > is
> >> >> >> > > > > > > > > available
> >> >> >> > > > > > > > > in Solr, Solr is based on Lucene only, right?
> >> >> >> > > > > > > > > I request Uwe to give me some more ideas on using
> the
> >> >> >> > analyzers
> >> >> >> > > > > from
> >> >> >> > > > > > > solr
> >> >> >> > > > > > > > > that will do the job for me, handling a mix of
> both
> >> >> english
> >> >> >> > and
> >> >> >> > > > > > > > non-english
> >> >> >> > > > > > > > > content.
> >> >> >> > > > > > > > > Muir, can you give me a bit detail description of
> how
> >> to
> >> >> use
> >> >> >> > > the
> >> >> >> > > > > > > > > WordDelimiteFilter to do my job.
> >> >> >> > > > > > > > > On a side note, I was thingking of writing a
> simple
> >> >> analyzer
> >> >> >> > > that
> >> >> >> > > > > > will
> >> >> >> > > > > > > do
> >> >> >> > > > > > > > > the following,
> >> >> >> > > > > > > > > #. If the webpage fragment is non-english[for me
> its
> >> >> some
> >> >> >> > > indian
> >> >> >> > > > > > > > language]
> >> >> >> > > > > > > > > then index them as such, no stemming/ stop word
> >> removal
> >> >> to
> >> >> >> > > begin
> >> >> >> > > > > > with.
> >> >> >> > > > > > > As
> >> >> >> > > > > > > > I
> >> >> >> > > > > > > > > know its in UCN unicode something like
> >> >> >> > > > > \u0021\u0012\u34ae\u0031[just
> >> >> >> > > > > > a
> >> >> >> > > > > > > > > sample]
> >> >> >> > > > > > > > > # If the fragment is english then apply standard
> >> >> anlyzing
> >> >> >> > > process
> >> >> >> > > > > for
> >> >> >> > > > > > > > > english content. I've not thought of quering in
> the
> >> same
> >> >> way
> >> >> >> > as
> >> >> >> > > > of
> >> >> >> > > > > > now
> >> >> >> > > > > > > > i.e
> >> >> >> > > > > > > > > mix of non-english and engish words.
> >> >> >> > > > > > > > > Now to get all this,
> >> >> >> > > > > > > > >  #1. I need some sort of way which will let me
> know
> >> if
> >> >> the
> >> >> >> > > > content
> >> >> >> > > > > is
> >> >> >> > > > > > > > > english or not. If not english just add the tokens
> to
> >> >> the
> >> >> >> > > > document.
> >> >> >> > > > > > Do
> >> >> >> > > > > > > we
> >> >> >> > > > > > > > > really need language identifiers, as i dont have
> any
> >> >> other
> >> >> >> > > > content
> >> >> >> > > > > > that
> >> >> >> > > > > > > > > uses
> >> >> >> > > > > > > > > the same script as english other than those \u1234
> >> >> things for
> >> >> >> > > my
> >> >> >> > > > > > indian
> >> >> >> > > > > > > > > language content. Any smart hack/trick for the
> same?
> >> >> >> > > > > > > > >  #2. If the its english apply all normal process
> and
> >> add
> >> >> the
> >> >> >> > > > > stemmed
> >> >> >> > > > > > > > token
> >> >> >> > > > > > > > > to document.
> >> >> >> > > > > > > > > For all this I was thinking of iterating earch
> word
> >> of
> >> >> the
> >> >> >> > web
> >> >> >> > > > page
> >> >> >> > > > > > and
> >> >> >> > > > > > > > > apply the above procedure. And finallyadd  the
> newly
> >> >> created
> >> >> >> > > > > document
> >> >> >> > > > > > > to
> >> >> >> > > > > > > > > the
> >> >> >> > > > > > > > > index.
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > I would like some one to guide me in this
> direction.
> >> I'm
> >> >> >> > pretty
> >> >> >> > > > > > people
> >> >> >> > > > > > > > must
> >> >> >> > > > > > > > > have done similar/same thing earlier, I request
> them
> >> to
> >> >> guide
> >> >> >> > > me/
> >> >> >> > > > > > point
> >> >> >> > > > > > > > me
> >> >> >> > > > > > > > > to some tutorials for the same.
> >> >> >> > > > > > > > > Else help me out writing a custom analyzer only if
> >> thats
> >> >> not
> >> >> >> > > > going
> >> >> >> > > > > to
> >> >> >> > > > > > > be
> >> >> >> > > > > > > > > too
> >> >> >> > > > > > > > > complex. LOL, I'm a new user to lucene and know
> >> basics
> >> >> of
> >> >> >> > Java
> >> >> >> > > > > > coding.
> >> >> >> > > > > > > > > Thank you very much.
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > --KK.
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <
> >> >> >> > rcmuir@gmail.com>
> >> >> >> > > > > > wrote:
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > > yes this is true. for starters KK, might be good
> to
> >> >> startup
> >> >> >> > > > solr
> >> >> >> > > > > > and
> >> >> >> > > > > > > > look
> >> >> >> > > > > > > > > > at
> >> >> >> > > > > > > > > >
> >> >> http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > > if you want to stick with lucene, the
> >> >> WordDelimiterFilter
> >> >> >> > is
> >> >> >> > > > the
> >> >> >> > > > > > > piece
> >> >> >> > > > > > > > > you
> >> >> >> > > > > > > > > > will want for your text, mainly for punctuation
> but
> >> >> also
> >> >> >> > for
> >> >> >> > > > > format
> >> >> >> > > > > > > > > > characters such as ZWJ/ZWNJ.
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <
> >> >> >> > > uwe@thetaphi.de
> >> >> >> > > > >
> >> >> >> > > > > > > wrote:
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > > > You can also re-use the solr analyzers, as far
> as
> >> I
> >> >> found
> >> >> >> > > > out.
> >> >> >> > > > > > > There
> >> >> >> > > > > > > > is
> >> >> >> > > > > > > > > > an
> >> >> >> > > > > > > > > > > issue in jIRA/discussion on java-dev to merge
> >> them.
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > > -----
> >> >> >> > > > > > > > > > > Uwe Schindler
> >> >> >> > > > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> >> >> >> > > > > > > > > > > http://www.thetaphi.de
> >> >> >> > > > > > > > > > > eMail: uwe@thetaphi.de
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > > > -----Original Message-----
> >> >> >> > > > > > > > > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
> >> >> >> > > > > > > > > > > > Sent: Thursday, June 04, 2009 1:18 PM
> >> >> >> > > > > > > > > > > > To: java-user@lucene.apache.org
> >> >> >> > > > > > > > > > > > Subject: Re: How to support stemming and
> case
> >> >> folding
> >> >> >> > for
> >> >> >> > > > > > english
> >> >> >> > > > > > > > > > content
> >> >> >> > > > > > > > > > > > mixed with non-english content?
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > KK, ok, so you only really want to stem the
> >> >> english.
> >> >> >> > This
> >> >> >> > > > is
> >> >> >> > > > > > > good.
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > Is it possible for you to consider using
> solr?
> >> >> solr's
> >> >> >> > > > default
> >> >> >> > > > > > > > > analyzer
> >> >> >> > > > > > > > > > > for
> >> >> >> > > > > > > > > > > > type 'text' will be good for your case. it
> will
> >> do
> >> >> the
> >> >> >> > > > > > following
> >> >> >> > > > > > > > > > > > 1. tokenize on whitespace
> >> >> >> > > > > > > > > > > > 2. handle both indian language and english
> >> >> punctuation
> >> >> >> > > > > > > > > > > > 3. lowercase the english.
> >> >> >> > > > > > > > > > > > 4. stem the english.
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > try a nightly build,
> >> >> >> > > > > > > > > > >
> >> >> http://people.apache.org/builds/lucene/solr/nightly/
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <
> >> >> >> > > > > dioxide.software@gmail.com
> >> >> >> > > > > > >
> >> >> >> > > > > > > > > wrote:
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > Muir, thanks for your response.
> >> >> >> > > > > > > > > > > > > I'm indexing indian language web pages
> which
> >> has
> >> >> got
> >> >> >> > > > > descent
> >> >> >> > > > > > > > amount
> >> >> >> > > > > > > > > > of
> >> >> >> > > > > > > > > > > > > english content mixed with therein. For
> the
> >> time
> >> >> >> > being
> >> >> >> > > > I'm
> >> >> >> > > > > > not
> >> >> >> > > > > > > > > going
> >> >> >> > > > > > > > > > to
> >> >> >> > > > > > > > > > > > use
> >> >> >> > > > > > > > > > > > > any stemmers as we don't have standard
> >> stemmers
> >> >> for
> >> >> >> > > > indian
> >> >> >> > > > > > > > > languages
> >> >> >> > > > > > > > > > .
> >> >> >> > > > > > > > > > > > So
> >> >> >> > > > > > > > > > > > > what I want to do is like this,
> >> >> >> > > > > > > > > > > > > Say I've a web page having hindi content
> with
> >> 5%
> >> >> >> > > english
> >> >> >> > > > > > > content.
> >> >> >> > > > > > > > > > Then
> >> >> >> > > > > > > > > > > > for
> >> >> >> > > > > > > > > > > > > hindi I want to use the basic white space
> >> >> analyzer as
> >> >> >> > > we
> >> >> >> > > > > dont
> >> >> >> > > > > > > > have
> >> >> >> > > > > > > > > > > > stemmers
> >> >> >> > > > > > > > > > > > > for this as I mentioned earlier and
> whereever
> >> >> english
> >> >> >> > > > > appears
> >> >> >> > > > > > I
> >> >> >> > > > > > > > > want
> >> >> >> > > > > > > > > > > > them
> >> >> >> > > > > > > > > > > > > to
> >> >> >> > > > > > > > > > > > > be stemmed tokenized etc[the standard
> process
> >> >> used
> >> >> >> > for
> >> >> >> > > > > > english
> >> >> >> > > > > > > > > > > content].
> >> >> >> > > > > > > > > > > > As
> >> >> >> > > > > > > > > > > > > of now I'm using whitespace analyzer for
> the
> >> >> full
> >> >> >> > > content
> >> >> >> > > > > > which
> >> >> >> > > > > > > > > > doesnot
> >> >> >> > > > > > > > > > > > > support case folding, stemming etc for teh
> >> >> content.
> >> >> >> > So
> >> >> >> > > if
> >> >> >> > > > > > there
> >> >> >> > > > > > > > is
> >> >> >> > > > > > > > > an
> >> >> >> > > > > > > > > > > > > english word say "Detection" indexed as
> such
> >> >> then
> >> >> >> > > > searching
> >> >> >> > > > > > for
> >> >> >> > > > > > > > > > > > detection
> >> >> >> > > > > > > > > > > > > or
> >> >> >> > > > > > > > > > > > > detect is not giving any results, which is
> >> the
> >> >> >> > expected
> >> >> >> > > > > > > behavior,
> >> >> >> > > > > > > > > but
> >> >> >> > > > > > > > > > I
> >> >> >> > > > > > > > > > > > > want
> >> >> >> > > > > > > > > > > > > this kind of queries to give results.
> >> >> >> > > > > > > > > > > > > I hope I made it clear. Let me know any
> ideas
> >> on
> >> >> >> > doing
> >> >> >> > > > the
> >> >> >> > > > > > > same.
> >> >> >> > > > > > > > > And
> >> >> >> > > > > > > > > > > one
> >> >> >> > > > > > > > > > > > > more thing, I'm storing the full webpage
> >> content
> >> >> >> > under
> >> >> >> > > a
> >> >> >> > > > > > single
> >> >> >> > > > > > > > > > field,
> >> >> >> > > > > > > > > > > I
> >> >> >> > > > > > > > > > > > > hope this will not make any difference,
> >> right?
> >> >> >> > > > > > > > > > > > > It seems I've to use language identifiers,
> >> but
> >> >> do we
> >> >> >> > > > really
> >> >> >> > > > > > > need
> >> >> >> > > > > > > > > > that?
> >> >> >> > > > > > > > > > > > > Because we've only non-english content
> mixed
> >> >> with
> >> >> >> > > > > english[and
> >> >> >> > > > > > > not
> >> >> >> > > > > > > > > > > french
> >> >> >> > > > > > > > > > > > or
> >> >> >> > > > > > > > > > > > > russian etc].
> >> >> >> > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > What is the best way of approaching the
> >> problem?
> >> >> Any
> >> >> >> > > > > > thoughts!
> >> >> >> > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > Thanks,
> >> >> >> > > > > > > > > > > > > KK.
> >> >> >> > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert
> Muir <
> >> >> >> > > > > > rcmuir@gmail.com>
> >> >> >> > > > > > > > > > wrote:
> >> >> >> > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > > KK, is all of your latin script text
> >> actually
> >> >> >> > > english?
> >> >> >> > > > Is
> >> >> >> > > > > > > there
> >> >> >> > > > > > > > > > stuff
> >> >> >> > > > > > > > > > > > > like
> >> >> >> > > > > > > > > > > > > > german or french mixed in?
> >> >> >> > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > > And for your non-english content (your
> >> >> examples
> >> >> >> > have
> >> >> >> > > > been
> >> >> >> > > > > > > > indian
> >> >> >> > > > > > > > > > > > writing
> >> >> >> > > > > > > > > > > > > > systems), is it generally true that if
> you
> >> had
> >> >> >> > > > > devanagari,
> >> >> >> > > > > > > you
> >> >> >> > > > > > > > > can
> >> >> >> > > > > > > > > > > > assume
> >> >> >> > > > > > > > > > > > > > its hindi? or is there stuff like
> marathi
> >> >> mixed in?
> >> >> >> > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > > Reason I say this is to invoke the right
> >> >> stemmers,
> >> >> >> > > you
> >> >> >> > > > > > really
> >> >> >> > > > > > > > > need
> >> >> >> > > > > > > > > > > > some
> >> >> >> > > > > > > > > > > > > > language detection, but perhaps in your
> >> case
> >> >> you
> >> >> >> > can
> >> >> >> > > > > cheat
> >> >> >> > > > > > > and
> >> >> >> > > > > > > > > > detect
> >> >> >> > > > > > > > > > > > > this
> >> >> >> > > > > > > > > > > > > > based on scripts...
> >> >> >> > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > > Thanks,
> >> >> >> > > > > > > > > > > > > > Robert
> >> >> >> > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
> >> >> >> > > > > > > > dioxide.software@gmail.com>
> >> >> >> > > > > > > > > > > > wrote:
> >> >> >> > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > > > Hi All,
> >> >> >> > > > > > > > > > > > > > > I'm indexing some non-english content.
> >> But
> >> >> the
> >> >> >> > page
> >> >> >> > > > > also
> >> >> >> > > > > > > > > contains
> >> >> >> > > > > > > > > > > > > english
> >> >> >> > > > > > > > > > > > > > > content. As of now I'm using
> >> >> WhitespaceAnalyzer
> >> >> >> > for
> >> >> >> > > > all
> >> >> >> > > > > > > > content
> >> >> >> > > > > > > > > > and
> >> >> >> > > > > > > > > > > > I'm
> >> >> >> > > > > > > > > > > > > > > storing the full webpage content under
> a
> >> >> single
> >> >> >> > > > filed.
> >> >> >> > > > > > Now
> >> >> >> > > > > > > we
> >> >> >> > > > > > > > > > > > require
> >> >> >> > > > > > > > > > > > > to
> >> >> >> > > > > > > > > > > > > > > support case folding and stemmming for
> >> the
> >> >> >> > english
> >> >> >> > > > > > content
> >> >> >> > > > > > > > > > > > intermingled
> >> >> >> > > > > > > > > > > > > > > with
> >> >> >> > > > > > > > > > > > > > > non-english content. I must metion
> that
> >> we
> >> >> dont
> >> >> >> > > have
> >> >> >> > > > > > > stemming
> >> >> >> > > > > > > > > and
> >> >> >> > > > > > > > > > > > case
> >> >> >> > > > > > > > > > > > > > > folding for these non-english content.
> >> I'm
> >> >> stuck
> >> >> >> > > with
> >> >> >> > > > > > this.
> >> >> >> > > > > > > > > Some
> >> >> >> > > > > > > > > > > one
> >> >> >> > > > > > > > > > > > do
> >> >> >> > > > > > > > > > > > > > let
> >> >> >> > > > > > > > > > > > > > > me know how to proceed for fixing this
> >> >> issue.
> >> >> >> > > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > > > Thanks,
> >> >> >> > > > > > > > > > > > > > > KK.
> >> >> >> > > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > > --
> >> >> >> > > > > > > > > > > > > > Robert Muir
> >> >> >> > > > > > > > > > > > > > rcmuir@gmail.com
> >> >> >> > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > --
> >> >> >> > > > > > > > > > > > Robert Muir
> >> >> >> > > > > > > > > > > > rcmuir@gmail.com
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > >
> >> >> >> > >
> >> >> ---------------------------------------------------------------------
> >> >> >> > > > > > > > > > > To unsubscribe, e-mail:
> >> >> >> > > > > java-user-unsubscribe@lucene.apache.org
> >> >> >> > > > > > > > > > > For additional commands, e-mail:
> >> >> >> > > > > > java-user-help@lucene.apache.org
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > > --
> >> >> >> > > > > > > > > > Robert Muir
> >> >> >> > > > > > > > > > rcmuir@gmail.com
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > --
> >> >> >> > > > > > > > Robert Muir
> >> >> >> > > > > > > > rcmuir@gmail.com
> >> >> >> > > > > > > >
> >> >> >> > > > > > >
> >> >> >> > > > > >
> >> >> >> > > > > >
> >> >> >> > > > > >
> >> >> >> > > > > > --
> >> >> >> > > > > > Robert Muir
> >> >> >> > > > > > rcmuir@gmail.com
> >> >> >> > > > > >
> >> >> >> > > > >
> >> >> >> > > >
> >> >> >> > > >
> >> >> >> > > >
> >> >> >> > > > --
> >> >> >> > > > Robert Muir
> >> >> >> > > > rcmuir@gmail.com
> >> >> >> > > >
> >> >> >> > >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > --
> >> >> >> > Robert Muir
> >> >> >> > rcmuir@gmail.com
> >> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Robert Muir
> >> >> > rcmuir@gmail.com
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Robert Muir
> >> >> rcmuir@gmail.com
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Robert Muir
> >> rcmuir@gmail.com
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message