lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From KK <dioxide.softw...@gmail.com>
Subject Re: How to support stemming and case folding for english content mixed with non-english content?
Date Mon, 08 Jun 2009 05:03:47 GMT
Hi Robert,
The problem is that worddelimiterfilter is doing its job for english content
but for non-english indian content which are unicoded it highlights the
searched word but alongwith that it also highlights the characters of that
word which was not hapenning without worddelimfilter, thats my concern. Say
for example I searched for a hindi word say "xyz ab" [assume these are in
hindi]  then in the search results it highlights these words but it also
highlights x/y/z/a/b whereever these letters appear which is obiviously
sounds bad. it should only highlight words not the letters therein. I hope I
made it clear. What could be the reason for this? Any idea on fixing the
same.

Thanks,
KK

On Sat, Jun 6, 2009 at 9:45 PM, Robert Muir <rcmuir@gmail.com> wrote:

> kk, i haven't had that experience with worddelimiterfilter on indian
> languages, is it possible you could provide me an example of how its
> creating nuisance?
>
> On Sat, Jun 6, 2009 at 9:42 AM, KK<dioxide.software@gmail.com> wrote:
> > Robert, I tried to use worddelimiterfilter from solr-nightly by putting
> it
> > in my working directory for this project, I set the parameters as you
> told
> > me. I must accept that its splitting words around those chars[like . @
> > etc]but alongwith that its messing with other non-english/unicode
> contents
> > and thats creating nuisance. I dont want worddelimiterfilter to fiddle
> > around with my non-english content.
> > This is what I'm doing,
> > /**
> >  * Analyzer for Indian language.
> >  */
> > public class IndicAnalyzer extends Analyzer {
> >  public TokenStream tokenStream(String fieldName, Reader reader) {
> >    TokenStream ts = new WhitespaceTokenizer(reader);
> >    ts = new WordDelimiterFilter(ts, 1, 1, 1, 1, 0);
> >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> >    ts = new LowerCaseFilter(ts);
> >    ts = new PorterStemFilter(ts);
> >    return ts;
> >  }
> > }
> >
> > I've to use the deprecated API for setting 5 values, thats fine, but
> somehow
> > its messing with unicode content. How to get rid of that? Any thougts? It
> > seems setting those values is some proper way might fix the problem, I'm
> not
> > sure, though.
> >
> > Thanks,
> > KK.
> >
> >
> > On Fri, Jun 5, 2009 at 7:37 PM, Robert Muir <rcmuir@gmail.com> wrote:
> >
> >> kk an easier solution to your first problem is to use
> >> worddelimiterfilterfactory if possible... you can get an instance of
> >> worddelimiter filter from that.
> >>
> >> thanks,
> >> robert
> >>
> >> On Fri, Jun 5, 2009 at 10:06 AM, Robert Muir<rcmuir@gmail.com> wrote:
> >> > kk as for your first issue, that WordDelimiterFilter is package
> >> > protected, one option is to make a copy of the code and change the
> >> > class declaration to public.
> >> > the other option is to put your entire analyzer in
> >> > 'org.apache.solr.analysis' package so that you can access it...
> >> >
> >> > for the 2nd issue, yes you need to supply some options to it. the
> >> > default options solr applies to type 'text' seemed to work well for me
> >> > with indic:
> >> >
> >> > {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1,
> >> > generateWordParts=1, catenateAll=0, catenateNumbers=1}
> >> >
> >> > On Fri, Jun 5, 2009 at 9:12 AM, KK <dioxide.software@gmail.com>
> wrote:
> >> >>
> >> >> Thanks Robert. There is one problem though, I'm able to plugin the
> word
> >> >> delimiter filter from solr-nightly jar file. When I tried to do
> >> something
> >> >> like,
> >> >>  TokenStream ts = new WhitespaceTokenizer(reader);
> >> >>   ts = new WordDelimiterFilter(ts);
> >> >>   ts = new PorterStemmerFilter(ts);
> >> >>   ...rest as in the last mail...
> >> >>
> >> >> It gave me an error saying that
> >> >>
> >> >> org.apache.solr.analysis.WordDelimiterFilter is not public in
> >> >> org.apache.solr.analysis; cannot be accessed from outside package
> >> >> import org.apache.solr.analysis.WordDelimiterFilter;
> >> >>                               ^
> >> >> solrSearch/IndicAnalyzer.java:38: cannot find symbol
> >> >> symbol  : class WordDelimiterFilter
> >> >> location: class solrSearch.IndicAnalyzer
> >> >>    ts = new WordDelimiterFilter(ts);
> >> >>             ^
> >> >> 2 errors
> >> >>
> >> >> Then i tried to see the code for worddelimitefiter from solrnightly
> src
> >> and
> >> >> found that there are many deprecated constructors though they require
> a
> >> lot
> >> >> of parameters alongwith tokenstream. I went through the solr wiki for
> >> >> worddelimiterfilterfactory here,
> >> >>
> >>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089
> >> >> and say that there also its specified that we've to mention the
> >> parameters
> >> >> and both are different for indexing and querying.
> >> >> I'm kind of stuck here, how do I make use of worddelimiterfilter in
> my
> >> >> custom analyzer, I've to use it anyway.
> >> >> In my code I've to make use of worddelimiterfilter and not
> >> >> worddelimiterfilterfactory, right? I don't know whats the use of the
> >> other
> >> >> one. Anyway can you guide me getting rid of the above error. And yes
> >> I'll
> >> >> change the order of applying the filters as you said.
> >> >>
> >> >> Thanks,
> >> >> KK.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Fri, Jun 5, 2009 at 5:48 PM, Robert Muir <rcmuir@gmail.com>
> wrote:
> >> >>
> >> >> > KK, you got the right idea.
> >> >> >
> >> >> > though I think you might want to change the order, move the
> stopfilter
> >> >> > before the porter stem filter... otherwise it might not work
> >> correctly.
> >> >> >
> >> >> > On Fri, Jun 5, 2009 at 8:05 AM, KK <dioxide.software@gmail.com>
> >> wrote:
> >> >> >
> >> >> > > Thanks Robert. This is exactly what I did and  its working but
> >> delimiter
> >> >> > is
> >> >> > > missing I'm going to add that from solr-nightly.jar
> >> >> > >
> >> >> > > /**
> >> >> > >  * Analyzer for Indian language.
> >> >> > >  */
> >> >> > > public class IndicAnalyzer extends Analyzer {
> >> >> > >  public TokenStream tokenStream(String fieldName, Reader reader)
> {
> >> >> > >     TokenStream ts = new WhitespaceTokenizer(reader);
> >> >> > >    ts = new PorterStemFilter(ts);
> >> >> > >    ts = new LowerCaseFilter(ts);
> >> >> > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> >> >> > >    return ts;
> >> >> > >  }
> >> >> > > }
> >> >> > >
> >> >> > > Its able to do stemming/case-folding and supports search for both
> >> english
> >> >> > > and indic texts. let me try out the delimiter. Will update you on
> >> that.
> >> >> > >
> >> >> > > Thanks a lot.
> >> >> > > KK
> >> >> > >
> >> >> > > On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rcmuir@gmail.com>
> >> wrote:
> >> >> > >
> >> >> > > > i think you are on the right track... once you build your
> >> analyzer, put
> >> >> > > it
> >> >> > > > in your classpath and play around with it in luke and see if it
> >> does
> >> >> > what
> >> >> > > > you want.
> >> >> > > >
> >> >> > > > On Fri, Jun 5, 2009 at 3:19 AM, KK <dioxide.software@gmail.com
> >
> >> wrote:
> >> >> > > >
> >> >> > > > > Hi Robert,
> >> >> > > > > This is what I copied from ThaiAnalyzer @ lucene contrib
> >> >> > > > >
> >> >> > > > > public class ThaiAnalyzer extends Analyzer {
> >> >> > > > >  public TokenStream tokenStream(String fieldName, Reader
> reader)
> >> {
> >> >> > > > >      TokenStream ts = new StandardTokenizer(reader);
> >> >> > > > >    ts = new StandardFilter(ts);
> >> >> > > > >    ts = new ThaiWordFilter(ts);
> >> >> > > > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> >> >> > > > >    return ts;
> >> >> > > > >  }
> >> >> > > > > }
> >> >> > > > >
> >> >> > > > > Now as you said, I've to use whitespacetokenizer
> >> >> > > > > withworddelimitefilter[solr
> >> >> > > > > nightly.jar] stop wordremoval, porter stemmer etc , so it is
> >> >> > something
> >> >> > > > like
> >> >> > > > > this,
> >> >> > > > > public class IndicAnalyzer extends Analyzer {
> >> >> > > > >  public TokenStream tokenStream(String fieldName, Reader
> reader)
> >> {
> >> >> > > > >   TokenStream ts = new WhiteSpaceTokenizer(reader);
> >> >> > > > >   ts = new WordDelimiterFilter(ts);
> >> >> > > > >   ts = new LowerCaseFilter(ts);
> >> >> > > > >   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)
> //
> >> >> > english
> >> >> > > > > stop filter, is this the default one?
> >> >> > > > >   ts = new PorterFilter(ts);
> >> >> > > > >   return ts;
> >> >> > > > >  }
> >> >> > > > > }
> >> >> > > > >
> >> >> > > > > Does this sound OK? I think it will do the job...let me try
> it
> >> out..
> >> >> > > > > I dont need custom filter as per my requirement, at least not
> >> for
> >> >> > these
> >> >> > > > > basic things I'm doing? I think so...
> >> >> > > > >
> >> >> > > > > Thanks,
> >> >> > > > > KK.
> >> >> > > > >
> >> >> > > > >
> >> >> > > > > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <
> rcmuir@gmail.com>
> >> >> > wrote:
> >> >> > > > >
> >> >> > > > > > KK well you can always get some good examples from the
> lucene
> >> >> > contrib
> >> >> > > > > > codebase.
> >> >> > > > > > For example, look at the DutchAnalyzer, especially:
> >> >> > > > > >
> >> >> > > > > > TokenStream tokenStream(String fieldName, Reader reader)
> >> >> > > > > >
> >> >> > > > > > See how it combines a specified tokenizer with various
> >> filters?
> >> >> > this
> >> >> > > is
> >> >> > > > > > what
> >> >> > > > > > you want to do, except of course you want to use different
> >> >> > tokenizer
> >> >> > > > and
> >> >> > > > > > filters.
> >> >> > > > > >
> >> >> > > > > > On Thu, Jun 4, 2009 at 8:53 AM, KK <
> >> dioxide.software@gmail.com>
> >> >> > > wrote:
> >> >> > > > > >
> >> >> > > > > > > Thanks Muir.
> >> >> > > > > > > Thanks for letting me know that I dont need language
> >> identifiers.
> >> >> > > > > > >  I'll have a look and will try to write the analyzer. For
> my
> >> case
> >> >> > I
> >> >> > > > > think
> >> >> > > > > > > it
> >> >> > > > > > > wont be that difficult.
> >> >> > > > > > > BTW, can you point me to some sample codes/tutorials
> writing
> >> >> > custom
> >> >> > > > > > > analyzers. I could not find something in LIA2ndEdn. Is
> >> something
> >> >> > > > htere?
> >> >> > > > > > do
> >> >> > > > > > > let me know.
> >> >> > > > > > >
> >> >> > > > > > > Thanks,
> >> >> > > > > > > KK.
> >> >> > > > > > >
> >> >> > > > > > >
> >> >> > > > > > >
> >> >> > > > > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <
> >> rcmuir@gmail.com>
> >> >> > > > wrote:
> >> >> > > > > > >
> >> >> > > > > > > > KK, for your case, you don't really need to go to the
> >> effort of
> >> >> > > > > > detecting
> >> >> > > > > > > > whether fragments are english or not.
> >> >> > > > > > > > Because the English stemmers in lucene will not modify
> >> your
> >> >> > Indic
> >> >> > > > > text,
> >> >> > > > > > > and
> >> >> > > > > > > > neither will the LowerCaseFilter.
> >> >> > > > > > > >
> >> >> > > > > > > > what you want to do is create a custom analyzer that
> works
> >> like
> >> >> > > > this
> >> >> > > > > > > >
> >> >> > > > > > > > -WhitespaceTokenizer with WordDelimiterFilter [from
> Solr
> >> >> > nightly
> >> >> > > > > jar],
> >> >> > > > > > > > LowerCaseFilter, StopFilter, and PorterStemFilter-
> >> >> > > > > > > >
> >> >> > > > > > > > Thanks,
> >> >> > > > > > > > Robert
> >> >> > > > > > > >
> >> >> > > > > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <
> >> dioxide.software@gmail.com
> >> >> > >
> >> >> > > > > wrote:
> >> >> > > > > > > >
> >> >> > > > > > > > > Thank you all.
> >> >> > > > > > > > > To be frank I was using Solr in the begining half a
> >> month
> >> >> > ago.
> >> >> > > > The
> >> >> > > > > > > > > problem[rather bug] with solr was creation of new
> index
> >> on
> >> >> > the
> >> >> > > > fly.
> >> >> > > > > > > > Though
> >> >> > > > > > > > > they have a restful method for teh same, but it was
> not
> >> >> > > working.
> >> >> > > > If
> >> >> > > > > I
> >> >> > > > > > > > > remember properly one of Solr commiter "Noble Paul"[I
> >> dont
> >> >> > know
> >> >> > > > his
> >> >> > > > > > > real
> >> >> > > > > > > > > name] was trying to help me. I tried many nightly
> builds
> >> and
> >> >> > > > > spending
> >> >> > > > > > a
> >> >> > > > > > > > > couple of days stuck at that made me think of lucene
> and
> >> I
> >> >> > > > switched
> >> >> > > > > > to
> >> >> > > > > > > > it.
> >> >> > > > > > > > > Now after working with lucene which gives you full
> >> control of
> >> >> > > > > > > everything
> >> >> > > > > > > > I
> >> >> > > > > > > > > don't want to switch to Solr.[LOL, to me Solr:Lucene
> is
> >> >> > similar
> >> >> > > > to
> >> >> > > > > > > > > Window$:Linux, its my view only, though]. Coming back
> to
> >> the
> >> >> > > > point
> >> >> > > > > as
> >> >> > > > > > > Uwe
> >> >> > > > > > > > > mentioned that we can do the same thing in lucene as
> >> well,
> >> >> > what
> >> >> > > > is
> >> >> > > > > > > > > available
> >> >> > > > > > > > > in Solr, Solr is based on Lucene only, right?
> >> >> > > > > > > > > I request Uwe to give me some more ideas on using the
> >> >> > analyzers
> >> >> > > > > from
> >> >> > > > > > > solr
> >> >> > > > > > > > > that will do the job for me, handling a mix of both
> >> english
> >> >> > and
> >> >> > > > > > > > non-english
> >> >> > > > > > > > > content.
> >> >> > > > > > > > > Muir, can you give me a bit detail description of how
> to
> >> use
> >> >> > > the
> >> >> > > > > > > > > WordDelimiteFilter to do my job.
> >> >> > > > > > > > > On a side note, I was thingking of writing a simple
> >> analyzer
> >> >> > > that
> >> >> > > > > > will
> >> >> > > > > > > do
> >> >> > > > > > > > > the following,
> >> >> > > > > > > > > #. If the webpage fragment is non-english[for me its
> >> some
> >> >> > > indian
> >> >> > > > > > > > language]
> >> >> > > > > > > > > then index them as such, no stemming/ stop word
> removal
> >> to
> >> >> > > begin
> >> >> > > > > > with.
> >> >> > > > > > > As
> >> >> > > > > > > > I
> >> >> > > > > > > > > know its in UCN unicode something like
> >> >> > > > > \u0021\u0012\u34ae\u0031[just
> >> >> > > > > > a
> >> >> > > > > > > > > sample]
> >> >> > > > > > > > > # If the fragment is english then apply standard
> >> anlyzing
> >> >> > > process
> >> >> > > > > for
> >> >> > > > > > > > > english content. I've not thought of quering in the
> same
> >> way
> >> >> > as
> >> >> > > > of
> >> >> > > > > > now
> >> >> > > > > > > > i.e
> >> >> > > > > > > > > mix of non-english and engish words.
> >> >> > > > > > > > > Now to get all this,
> >> >> > > > > > > > >  #1. I need some sort of way which will let me know
> if
> >> the
> >> >> > > > content
> >> >> > > > > is
> >> >> > > > > > > > > english or not. If not english just add the tokens to
> >> the
> >> >> > > > document.
> >> >> > > > > > Do
> >> >> > > > > > > we
> >> >> > > > > > > > > really need language identifiers, as i dont have any
> >> other
> >> >> > > > content
> >> >> > > > > > that
> >> >> > > > > > > > > uses
> >> >> > > > > > > > > the same script as english other than those \u1234
> >> things for
> >> >> > > my
> >> >> > > > > > indian
> >> >> > > > > > > > > language content. Any smart hack/trick for the same?
> >> >> > > > > > > > >  #2. If the its english apply all normal process and
> add
> >> the
> >> >> > > > > stemmed
> >> >> > > > > > > > token
> >> >> > > > > > > > > to document.
> >> >> > > > > > > > > For all this I was thinking of iterating earch word
> of
> >> the
> >> >> > web
> >> >> > > > page
> >> >> > > > > > and
> >> >> > > > > > > > > apply the above procedure. And finallyadd  the newly
> >> created
> >> >> > > > > document
> >> >> > > > > > > to
> >> >> > > > > > > > > the
> >> >> > > > > > > > > index.
> >> >> > > > > > > > >
> >> >> > > > > > > > > I would like some one to guide me in this direction.
> I'm
> >> >> > pretty
> >> >> > > > > > people
> >> >> > > > > > > > must
> >> >> > > > > > > > > have done similar/same thing earlier, I request them
> to
> >> guide
> >> >> > > me/
> >> >> > > > > > point
> >> >> > > > > > > > me
> >> >> > > > > > > > > to some tutorials for the same.
> >> >> > > > > > > > > Else help me out writing a custom analyzer only if
> thats
> >> not
> >> >> > > > going
> >> >> > > > > to
> >> >> > > > > > > be
> >> >> > > > > > > > > too
> >> >> > > > > > > > > complex. LOL, I'm a new user to lucene and know
> basics
> >> of
> >> >> > Java
> >> >> > > > > > coding.
> >> >> > > > > > > > > Thank you very much.
> >> >> > > > > > > > >
> >> >> > > > > > > > > --KK.
> >> >> > > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <
> >> >> > rcmuir@gmail.com>
> >> >> > > > > > wrote:
> >> >> > > > > > > > >
> >> >> > > > > > > > > > yes this is true. for starters KK, might be good to
> >> startup
> >> >> > > > solr
> >> >> > > > > > and
> >> >> > > > > > > > look
> >> >> > > > > > > > > > at
> >> >> > > > > > > > > >
> >> http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > if you want to stick with lucene, the
> >> WordDelimiterFilter
> >> >> > is
> >> >> > > > the
> >> >> > > > > > > piece
> >> >> > > > > > > > > you
> >> >> > > > > > > > > > will want for your text, mainly for punctuation but
> >> also
> >> >> > for
> >> >> > > > > format
> >> >> > > > > > > > > > characters such as ZWJ/ZWNJ.
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <
> >> >> > > uwe@thetaphi.de
> >> >> > > > >
> >> >> > > > > > > wrote:
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > > You can also re-use the solr analyzers, as far as
> I
> >> found
> >> >> > > > out.
> >> >> > > > > > > There
> >> >> > > > > > > > is
> >> >> > > > > > > > > > an
> >> >> > > > > > > > > > > issue in jIRA/discussion on java-dev to merge
> them.
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > > -----
> >> >> > > > > > > > > > > Uwe Schindler
> >> >> > > > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> >> >> > > > > > > > > > > http://www.thetaphi.de
> >> >> > > > > > > > > > > eMail: uwe@thetaphi.de
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > > > -----Original Message-----
> >> >> > > > > > > > > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
> >> >> > > > > > > > > > > > Sent: Thursday, June 04, 2009 1:18 PM
> >> >> > > > > > > > > > > > To: java-user@lucene.apache.org
> >> >> > > > > > > > > > > > Subject: Re: How to support stemming and case
> >> folding
> >> >> > for
> >> >> > > > > > english
> >> >> > > > > > > > > > content
> >> >> > > > > > > > > > > > mixed with non-english content?
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > KK, ok, so you only really want to stem the
> >> english.
> >> >> > This
> >> >> > > > is
> >> >> > > > > > > good.
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > Is it possible for you to consider using solr?
> >> solr's
> >> >> > > > default
> >> >> > > > > > > > > analyzer
> >> >> > > > > > > > > > > for
> >> >> > > > > > > > > > > > type 'text' will be good for your case. it will
> do
> >> the
> >> >> > > > > > following
> >> >> > > > > > > > > > > > 1. tokenize on whitespace
> >> >> > > > > > > > > > > > 2. handle both indian language and english
> >> punctuation
> >> >> > > > > > > > > > > > 3. lowercase the english.
> >> >> > > > > > > > > > > > 4. stem the english.
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > try a nightly build,
> >> >> > > > > > > > > > >
> >> http://people.apache.org/builds/lucene/solr/nightly/
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <
> >> >> > > > > dioxide.software@gmail.com
> >> >> > > > > > >
> >> >> > > > > > > > > wrote:
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > > Muir, thanks for your response.
> >> >> > > > > > > > > > > > > I'm indexing indian language web pages which
> has
> >> got
> >> >> > > > > descent
> >> >> > > > > > > > amount
> >> >> > > > > > > > > > of
> >> >> > > > > > > > > > > > > english content mixed with therein. For the
> time
> >> >> > being
> >> >> > > > I'm
> >> >> > > > > > not
> >> >> > > > > > > > > going
> >> >> > > > > > > > > > to
> >> >> > > > > > > > > > > > use
> >> >> > > > > > > > > > > > > any stemmers as we don't have standard
> stemmers
> >> for
> >> >> > > > indian
> >> >> > > > > > > > > languages
> >> >> > > > > > > > > > .
> >> >> > > > > > > > > > > > So
> >> >> > > > > > > > > > > > > what I want to do is like this,
> >> >> > > > > > > > > > > > > Say I've a web page having hindi content with
> 5%
> >> >> > > english
> >> >> > > > > > > content.
> >> >> > > > > > > > > > Then
> >> >> > > > > > > > > > > > for
> >> >> > > > > > > > > > > > > hindi I want to use the basic white space
> >> analyzer as
> >> >> > > we
> >> >> > > > > dont
> >> >> > > > > > > > have
> >> >> > > > > > > > > > > > stemmers
> >> >> > > > > > > > > > > > > for this as I mentioned earlier and whereever
> >> english
> >> >> > > > > appears
> >> >> > > > > > I
> >> >> > > > > > > > > want
> >> >> > > > > > > > > > > > them
> >> >> > > > > > > > > > > > > to
> >> >> > > > > > > > > > > > > be stemmed tokenized etc[the standard process
> >> used
> >> >> > for
> >> >> > > > > > english
> >> >> > > > > > > > > > > content].
> >> >> > > > > > > > > > > > As
> >> >> > > > > > > > > > > > > of now I'm using whitespace analyzer for the
> >> full
> >> >> > > content
> >> >> > > > > > which
> >> >> > > > > > > > > > doesnot
> >> >> > > > > > > > > > > > > support case folding, stemming etc for teh
> >> content.
> >> >> > So
> >> >> > > if
> >> >> > > > > > there
> >> >> > > > > > > > is
> >> >> > > > > > > > > an
> >> >> > > > > > > > > > > > > english word say "Detection" indexed as such
> >> then
> >> >> > > > searching
> >> >> > > > > > for
> >> >> > > > > > > > > > > > detection
> >> >> > > > > > > > > > > > > or
> >> >> > > > > > > > > > > > > detect is not giving any results, which is
> the
> >> >> > expected
> >> >> > > > > > > behavior,
> >> >> > > > > > > > > but
> >> >> > > > > > > > > > I
> >> >> > > > > > > > > > > > > want
> >> >> > > > > > > > > > > > > this kind of queries to give results.
> >> >> > > > > > > > > > > > > I hope I made it clear. Let me know any ideas
> on
> >> >> > doing
> >> >> > > > the
> >> >> > > > > > > same.
> >> >> > > > > > > > > And
> >> >> > > > > > > > > > > one
> >> >> > > > > > > > > > > > > more thing, I'm storing the full webpage
> content
> >> >> > under
> >> >> > > a
> >> >> > > > > > single
> >> >> > > > > > > > > > field,
> >> >> > > > > > > > > > > I
> >> >> > > > > > > > > > > > > hope this will not make any difference,
> right?
> >> >> > > > > > > > > > > > > It seems I've to use language identifiers,
> but
> >> do we
> >> >> > > > really
> >> >> > > > > > > need
> >> >> > > > > > > > > > that?
> >> >> > > > > > > > > > > > > Because we've only non-english content mixed
> >> with
> >> >> > > > > english[and
> >> >> > > > > > > not
> >> >> > > > > > > > > > > french
> >> >> > > > > > > > > > > > or
> >> >> > > > > > > > > > > > > russian etc].
> >> >> > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > What is the best way of approaching the
> problem?
> >> Any
> >> >> > > > > > thoughts!
> >> >> > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > Thanks,
> >> >> > > > > > > > > > > > > KK.
> >> >> > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <
> >> >> > > > > > rcmuir@gmail.com>
> >> >> > > > > > > > > > wrote:
> >> >> > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > > KK, is all of your latin script text
> actually
> >> >> > > english?
> >> >> > > > Is
> >> >> > > > > > > there
> >> >> > > > > > > > > > stuff
> >> >> > > > > > > > > > > > > like
> >> >> > > > > > > > > > > > > > german or french mixed in?
> >> >> > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > > And for your non-english content (your
> >> examples
> >> >> > have
> >> >> > > > been
> >> >> > > > > > > > indian
> >> >> > > > > > > > > > > > writing
> >> >> > > > > > > > > > > > > > systems), is it generally true that if you
> had
> >> >> > > > > devanagari,
> >> >> > > > > > > you
> >> >> > > > > > > > > can
> >> >> > > > > > > > > > > > assume
> >> >> > > > > > > > > > > > > > its hindi? or is there stuff like marathi
> >> mixed in?
> >> >> > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > > Reason I say this is to invoke the right
> >> stemmers,
> >> >> > > you
> >> >> > > > > > really
> >> >> > > > > > > > > need
> >> >> > > > > > > > > > > > some
> >> >> > > > > > > > > > > > > > language detection, but perhaps in your
> case
> >> you
> >> >> > can
> >> >> > > > > cheat
> >> >> > > > > > > and
> >> >> > > > > > > > > > detect
> >> >> > > > > > > > > > > > > this
> >> >> > > > > > > > > > > > > > based on scripts...
> >> >> > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > > Thanks,
> >> >> > > > > > > > > > > > > > Robert
> >> >> > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
> >> >> > > > > > > > dioxide.software@gmail.com>
> >> >> > > > > > > > > > > > wrote:
> >> >> > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > > > Hi All,
> >> >> > > > > > > > > > > > > > > I'm indexing some non-english content.
> But
> >> the
> >> >> > page
> >> >> > > > > also
> >> >> > > > > > > > > contains
> >> >> > > > > > > > > > > > > english
> >> >> > > > > > > > > > > > > > > content. As of now I'm using
> >> WhitespaceAnalyzer
> >> >> > for
> >> >> > > > all
> >> >> > > > > > > > content
> >> >> > > > > > > > > > and
> >> >> > > > > > > > > > > > I'm
> >> >> > > > > > > > > > > > > > > storing the full webpage content under a
> >> single
> >> >> > > > filed.
> >> >> > > > > > Now
> >> >> > > > > > > we
> >> >> > > > > > > > > > > > require
> >> >> > > > > > > > > > > > > to
> >> >> > > > > > > > > > > > > > > support case folding and stemmming for
> the
> >> >> > english
> >> >> > > > > > content
> >> >> > > > > > > > > > > > intermingled
> >> >> > > > > > > > > > > > > > > with
> >> >> > > > > > > > > > > > > > > non-english content. I must metion that
> we
> >> dont
> >> >> > > have
> >> >> > > > > > > stemming
> >> >> > > > > > > > > and
> >> >> > > > > > > > > > > > case
> >> >> > > > > > > > > > > > > > > folding for these non-english content.
> I'm
> >> stuck
> >> >> > > with
> >> >> > > > > > this.
> >> >> > > > > > > > > Some
> >> >> > > > > > > > > > > one
> >> >> > > > > > > > > > > > do
> >> >> > > > > > > > > > > > > > let
> >> >> > > > > > > > > > > > > > > me know how to proceed for fixing this
> >> issue.
> >> >> > > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > > > Thanks,
> >> >> > > > > > > > > > > > > > > KK.
> >> >> > > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > > --
> >> >> > > > > > > > > > > > > > Robert Muir
> >> >> > > > > > > > > > > > > > rcmuir@gmail.com
> >> >> > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > >
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > --
> >> >> > > > > > > > > > > > Robert Muir
> >> >> > > > > > > > > > > > rcmuir@gmail.com
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > >
> >> >> > > > > > >
> >> >> > >
> >> ---------------------------------------------------------------------
> >> >> > > > > > > > > > > To unsubscribe, e-mail:
> >> >> > > > > java-user-unsubscribe@lucene.apache.org
> >> >> > > > > > > > > > > For additional commands, e-mail:
> >> >> > > > > > java-user-help@lucene.apache.org
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > >
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > --
> >> >> > > > > > > > > > Robert Muir
> >> >> > > > > > > > > > rcmuir@gmail.com
> >> >> > > > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > > --
> >> >> > > > > > > > Robert Muir
> >> >> > > > > > > > rcmuir@gmail.com
> >> >> > > > > > > >
> >> >> > > > > > >
> >> >> > > > > >
> >> >> > > > > >
> >> >> > > > > >
> >> >> > > > > > --
> >> >> > > > > > Robert Muir
> >> >> > > > > > rcmuir@gmail.com
> >> >> > > > > >
> >> >> > > > >
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > > --
> >> >> > > > Robert Muir
> >> >> > > > rcmuir@gmail.com
> >> >> > > >
> >> >> > >
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Robert Muir
> >> >> > rcmuir@gmail.com
> >> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Robert Muir
> >> > rcmuir@gmail.com
> >> >
> >>
> >>
> >>
> >> --
> >> Robert Muir
> >> rcmuir@gmail.com
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message