lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From KK <dioxide.softw...@gmail.com>
Subject Re: How to support stemming and case folding for english content mixed with non-english content?
Date Sat, 06 Jun 2009 13:42:13 GMT
Robert, I tried to use worddelimiterfilter from solr-nightly by putting it
in my working directory for this project, I set the parameters as you told
me. I must accept that its splitting words around those chars[like . @
etc]but alongwith that its messing with other non-english/unicode contents
and thats creating nuisance. I dont want worddelimiterfilter to fiddle
around with my non-english content.
This is what I'm doing,
/**
 * Analyzer for Indian language.
 */
public class IndicAnalyzer extends Analyzer {
  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream ts = new WhitespaceTokenizer(reader);
    ts = new WordDelimiterFilter(ts, 1, 1, 1, 1, 0);
    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
    ts = new LowerCaseFilter(ts);
    ts = new PorterStemFilter(ts);
    return ts;
  }
}

I've to use the deprecated API for setting 5 values, thats fine, but somehow
its messing with unicode content. How to get rid of that? Any thougts? It
seems setting those values is some proper way might fix the problem, I'm not
sure, though.

Thanks,
KK.


On Fri, Jun 5, 2009 at 7:37 PM, Robert Muir <rcmuir@gmail.com> wrote:

> kk an easier solution to your first problem is to use
> worddelimiterfilterfactory if possible... you can get an instance of
> worddelimiter filter from that.
>
> thanks,
> robert
>
> On Fri, Jun 5, 2009 at 10:06 AM, Robert Muir<rcmuir@gmail.com> wrote:
> > kk as for your first issue, that WordDelimiterFilter is package
> > protected, one option is to make a copy of the code and change the
> > class declaration to public.
> > the other option is to put your entire analyzer in
> > 'org.apache.solr.analysis' package so that you can access it...
> >
> > for the 2nd issue, yes you need to supply some options to it. the
> > default options solr applies to type 'text' seemed to work well for me
> > with indic:
> >
> > {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1,
> > generateWordParts=1, catenateAll=0, catenateNumbers=1}
> >
> > On Fri, Jun 5, 2009 at 9:12 AM, KK <dioxide.software@gmail.com> wrote:
> >>
> >> Thanks Robert. There is one problem though, I'm able to plugin the word
> >> delimiter filter from solr-nightly jar file. When I tried to do
> something
> >> like,
> >>  TokenStream ts = new WhitespaceTokenizer(reader);
> >>   ts = new WordDelimiterFilter(ts);
> >>   ts = new PorterStemmerFilter(ts);
> >>   ...rest as in the last mail...
> >>
> >> It gave me an error saying that
> >>
> >> org.apache.solr.analysis.WordDelimiterFilter is not public in
> >> org.apache.solr.analysis; cannot be accessed from outside package
> >> import org.apache.solr.analysis.WordDelimiterFilter;
> >>                               ^
> >> solrSearch/IndicAnalyzer.java:38: cannot find symbol
> >> symbol  : class WordDelimiterFilter
> >> location: class solrSearch.IndicAnalyzer
> >>    ts = new WordDelimiterFilter(ts);
> >>             ^
> >> 2 errors
> >>
> >> Then i tried to see the code for worddelimitefiter from solrnightly src
> and
> >> found that there are many deprecated constructors though they require a
> lot
> >> of parameters alongwith tokenstream. I went through the solr wiki for
> >> worddelimiterfilterfactory here,
> >>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089
> >> and say that there also its specified that we've to mention the
> parameters
> >> and both are different for indexing and querying.
> >> I'm kind of stuck here, how do I make use of worddelimiterfilter in my
> >> custom analyzer, I've to use it anyway.
> >> In my code I've to make use of worddelimiterfilter and not
> >> worddelimiterfilterfactory, right? I don't know whats the use of the
> other
> >> one. Anyway can you guide me getting rid of the above error. And yes
> I'll
> >> change the order of applying the filters as you said.
> >>
> >> Thanks,
> >> KK.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Jun 5, 2009 at 5:48 PM, Robert Muir <rcmuir@gmail.com> wrote:
> >>
> >> > KK, you got the right idea.
> >> >
> >> > though I think you might want to change the order, move the stopfilter
> >> > before the porter stem filter... otherwise it might not work
> correctly.
> >> >
> >> > On Fri, Jun 5, 2009 at 8:05 AM, KK <dioxide.software@gmail.com>
> wrote:
> >> >
> >> > > Thanks Robert. This is exactly what I did and  its working but
> delimiter
> >> > is
> >> > > missing I'm going to add that from solr-nightly.jar
> >> > >
> >> > > /**
> >> > >  * Analyzer for Indian language.
> >> > >  */
> >> > > public class IndicAnalyzer extends Analyzer {
> >> > >  public TokenStream tokenStream(String fieldName, Reader reader) {
> >> > >     TokenStream ts = new WhitespaceTokenizer(reader);
> >> > >    ts = new PorterStemFilter(ts);
> >> > >    ts = new LowerCaseFilter(ts);
> >> > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> >> > >    return ts;
> >> > >  }
> >> > > }
> >> > >
> >> > > Its able to do stemming/case-folding and supports search for both
> english
> >> > > and indic texts. let me try out the delimiter. Will update you on
> that.
> >> > >
> >> > > Thanks a lot.
> >> > > KK
> >> > >
> >> > > On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rcmuir@gmail.com>
> wrote:
> >> > >
> >> > > > i think you are on the right track... once you build your
> analyzer, put
> >> > > it
> >> > > > in your classpath and play around with it in luke and see if
it
> does
> >> > what
> >> > > > you want.
> >> > > >
> >> > > > On Fri, Jun 5, 2009 at 3:19 AM, KK <dioxide.software@gmail.com>
> wrote:
> >> > > >
> >> > > > > Hi Robert,
> >> > > > > This is what I copied from ThaiAnalyzer @ lucene contrib
> >> > > > >
> >> > > > > public class ThaiAnalyzer extends Analyzer {
> >> > > > >  public TokenStream tokenStream(String fieldName, Reader
reader)
> {
> >> > > > >      TokenStream ts = new StandardTokenizer(reader);
> >> > > > >    ts = new StandardFilter(ts);
> >> > > > >    ts = new ThaiWordFilter(ts);
> >> > > > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> >> > > > >    return ts;
> >> > > > >  }
> >> > > > > }
> >> > > > >
> >> > > > > Now as you said, I've to use whitespacetokenizer
> >> > > > > withworddelimitefilter[solr
> >> > > > > nightly.jar] stop wordremoval, porter stemmer etc , so it
is
> >> > something
> >> > > > like
> >> > > > > this,
> >> > > > > public class IndicAnalyzer extends Analyzer {
> >> > > > >  public TokenStream tokenStream(String fieldName, Reader
reader)
> {
> >> > > > >   TokenStream ts = new WhiteSpaceTokenizer(reader);
> >> > > > >   ts = new WordDelimiterFilter(ts);
> >> > > > >   ts = new LowerCaseFilter(ts);
> >> > > > >   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)
  //
> >> > english
> >> > > > > stop filter, is this the default one?
> >> > > > >   ts = new PorterFilter(ts);
> >> > > > >   return ts;
> >> > > > >  }
> >> > > > > }
> >> > > > >
> >> > > > > Does this sound OK? I think it will do the job...let me
try it
> out..
> >> > > > > I dont need custom filter as per my requirement, at least
not
> for
> >> > these
> >> > > > > basic things I'm doing? I think so...
> >> > > > >
> >> > > > > Thanks,
> >> > > > > KK.
> >> > > > >
> >> > > > >
> >> > > > > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <rcmuir@gmail.com>
> >> > wrote:
> >> > > > >
> >> > > > > > KK well you can always get some good examples from
the lucene
> >> > contrib
> >> > > > > > codebase.
> >> > > > > > For example, look at the DutchAnalyzer, especially:
> >> > > > > >
> >> > > > > > TokenStream tokenStream(String fieldName, Reader reader)
> >> > > > > >
> >> > > > > > See how it combines a specified tokenizer with various
> filters?
> >> > this
> >> > > is
> >> > > > > > what
> >> > > > > > you want to do, except of course you want to use different
> >> > tokenizer
> >> > > > and
> >> > > > > > filters.
> >> > > > > >
> >> > > > > > On Thu, Jun 4, 2009 at 8:53 AM, KK <
> dioxide.software@gmail.com>
> >> > > wrote:
> >> > > > > >
> >> > > > > > > Thanks Muir.
> >> > > > > > > Thanks for letting me know that I dont need language
> identifiers.
> >> > > > > > >  I'll have a look and will try to write the analyzer.
For my
> case
> >> > I
> >> > > > > think
> >> > > > > > > it
> >> > > > > > > wont be that difficult.
> >> > > > > > > BTW, can you point me to some sample codes/tutorials
writing
> >> > custom
> >> > > > > > > analyzers. I could not find something in LIA2ndEdn.
Is
> something
> >> > > > htere?
> >> > > > > > do
> >> > > > > > > let me know.
> >> > > > > > >
> >> > > > > > > Thanks,
> >> > > > > > > KK.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <
> rcmuir@gmail.com>
> >> > > > wrote:
> >> > > > > > >
> >> > > > > > > > KK, for your case, you don't really need
to go to the
> effort of
> >> > > > > > detecting
> >> > > > > > > > whether fragments are english or not.
> >> > > > > > > > Because the English stemmers in lucene will
not modify
> your
> >> > Indic
> >> > > > > text,
> >> > > > > > > and
> >> > > > > > > > neither will the LowerCaseFilter.
> >> > > > > > > >
> >> > > > > > > > what you want to do is create a custom analyzer
that works
> like
> >> > > > this
> >> > > > > > > >
> >> > > > > > > > -WhitespaceTokenizer with WordDelimiterFilter
[from Solr
> >> > nightly
> >> > > > > jar],
> >> > > > > > > > LowerCaseFilter, StopFilter, and PorterStemFilter-
> >> > > > > > > >
> >> > > > > > > > Thanks,
> >> > > > > > > > Robert
> >> > > > > > > >
> >> > > > > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <
> dioxide.software@gmail.com
> >> > >
> >> > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Thank you all.
> >> > > > > > > > > To be frank I was using Solr in the
begining half a
> month
> >> > ago.
> >> > > > The
> >> > > > > > > > > problem[rather bug] with solr was creation
of new index
> on
> >> > the
> >> > > > fly.
> >> > > > > > > > Though
> >> > > > > > > > > they have a restful method for teh same,
but it was not
> >> > > working.
> >> > > > If
> >> > > > > I
> >> > > > > > > > > remember properly one of Solr commiter
"Noble Paul"[I
> dont
> >> > know
> >> > > > his
> >> > > > > > > real
> >> > > > > > > > > name] was trying to help me. I tried
many nightly builds
> and
> >> > > > > spending
> >> > > > > > a
> >> > > > > > > > > couple of days stuck at that made me
think of lucene and
> I
> >> > > > switched
> >> > > > > > to
> >> > > > > > > > it.
> >> > > > > > > > > Now after working with lucene which
gives you full
> control of
> >> > > > > > > everything
> >> > > > > > > > I
> >> > > > > > > > > don't want to switch to Solr.[LOL, to
me Solr:Lucene is
> >> > similar
> >> > > > to
> >> > > > > > > > > Window$:Linux, its my view only, though].
Coming back to
> the
> >> > > > point
> >> > > > > as
> >> > > > > > > Uwe
> >> > > > > > > > > mentioned that we can do the same thing
in lucene as
> well,
> >> > what
> >> > > > is
> >> > > > > > > > > available
> >> > > > > > > > > in Solr, Solr is based on Lucene only,
right?
> >> > > > > > > > > I request Uwe to give me some more ideas
on using the
> >> > analyzers
> >> > > > > from
> >> > > > > > > solr
> >> > > > > > > > > that will do the job for me, handling
a mix of both
> english
> >> > and
> >> > > > > > > > non-english
> >> > > > > > > > > content.
> >> > > > > > > > > Muir, can you give me a bit detail description
of how to
> use
> >> > > the
> >> > > > > > > > > WordDelimiteFilter to do my job.
> >> > > > > > > > > On a side note, I was thingking of writing
a simple
> analyzer
> >> > > that
> >> > > > > > will
> >> > > > > > > do
> >> > > > > > > > > the following,
> >> > > > > > > > > #. If the webpage fragment is non-english[for
me its
> some
> >> > > indian
> >> > > > > > > > language]
> >> > > > > > > > > then index them as such, no stemming/
stop word removal
> to
> >> > > begin
> >> > > > > > with.
> >> > > > > > > As
> >> > > > > > > > I
> >> > > > > > > > > know its in UCN unicode something like
> >> > > > > \u0021\u0012\u34ae\u0031[just
> >> > > > > > a
> >> > > > > > > > > sample]
> >> > > > > > > > > # If the fragment is english then apply
standard
> anlyzing
> >> > > process
> >> > > > > for
> >> > > > > > > > > english content. I've not thought of
quering in the same
> way
> >> > as
> >> > > > of
> >> > > > > > now
> >> > > > > > > > i.e
> >> > > > > > > > > mix of non-english and engish words.
> >> > > > > > > > > Now to get all this,
> >> > > > > > > > >  #1. I need some sort of way which will
let me know if
> the
> >> > > > content
> >> > > > > is
> >> > > > > > > > > english or not. If not english just
add the tokens to
> the
> >> > > > document.
> >> > > > > > Do
> >> > > > > > > we
> >> > > > > > > > > really need language identifiers, as
i dont have any
> other
> >> > > > content
> >> > > > > > that
> >> > > > > > > > > uses
> >> > > > > > > > > the same script as english other than
those \u1234
> things for
> >> > > my
> >> > > > > > indian
> >> > > > > > > > > language content. Any smart hack/trick
for the same?
> >> > > > > > > > >  #2. If the its english apply all normal
process and add
> the
> >> > > > > stemmed
> >> > > > > > > > token
> >> > > > > > > > > to document.
> >> > > > > > > > > For all this I was thinking of iterating
earch word of
> the
> >> > web
> >> > > > page
> >> > > > > > and
> >> > > > > > > > > apply the above procedure. And finallyadd
 the newly
> created
> >> > > > > document
> >> > > > > > > to
> >> > > > > > > > > the
> >> > > > > > > > > index.
> >> > > > > > > > >
> >> > > > > > > > > I would like some one to guide me in
this direction. I'm
> >> > pretty
> >> > > > > > people
> >> > > > > > > > must
> >> > > > > > > > > have done similar/same thing earlier,
I request them to
> guide
> >> > > me/
> >> > > > > > point
> >> > > > > > > > me
> >> > > > > > > > > to some tutorials for the same.
> >> > > > > > > > > Else help me out writing a custom analyzer
only if thats
> not
> >> > > > going
> >> > > > > to
> >> > > > > > > be
> >> > > > > > > > > too
> >> > > > > > > > > complex. LOL, I'm a new user to lucene
and know basics
> of
> >> > Java
> >> > > > > > coding.
> >> > > > > > > > > Thank you very much.
> >> > > > > > > > >
> >> > > > > > > > > --KK.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert
Muir <
> >> > rcmuir@gmail.com>
> >> > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > > > yes this is true. for starters
KK, might be good to
> startup
> >> > > > solr
> >> > > > > > and
> >> > > > > > > > look
> >> > > > > > > > > > at
> >> > > > > > > > > >
> http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> >> > > > > > > > > >
> >> > > > > > > > > > if you want to stick with lucene,
the
> WordDelimiterFilter
> >> > is
> >> > > > the
> >> > > > > > > piece
> >> > > > > > > > > you
> >> > > > > > > > > > will want for your text, mainly
for punctuation but
> also
> >> > for
> >> > > > > format
> >> > > > > > > > > > characters such as ZWJ/ZWNJ.
> >> > > > > > > > > >
> >> > > > > > > > > > On Thu, Jun 4, 2009 at 7:51 AM,
Uwe Schindler <
> >> > > uwe@thetaphi.de
> >> > > > >
> >> > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > You can also re-use the solr
analyzers, as far as I
> found
> >> > > > out.
> >> > > > > > > There
> >> > > > > > > > is
> >> > > > > > > > > > an
> >> > > > > > > > > > > issue in jIRA/discussion on
java-dev to merge them.
> >> > > > > > > > > > >
> >> > > > > > > > > > > -----
> >> > > > > > > > > > > Uwe Schindler
> >> > > > > > > > > > > H.-H.-Meier-Allee 63, D-28213
Bremen
> >> > > > > > > > > > > http://www.thetaphi.de
> >> > > > > > > > > > > eMail: uwe@thetaphi.de
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > > -----Original Message-----
> >> > > > > > > > > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
> >> > > > > > > > > > > > Sent: Thursday, June
04, 2009 1:18 PM
> >> > > > > > > > > > > > To: java-user@lucene.apache.org
> >> > > > > > > > > > > > Subject: Re: How to support
stemming and case
> folding
> >> > for
> >> > > > > > english
> >> > > > > > > > > > content
> >> > > > > > > > > > > > mixed with non-english
content?
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > KK, ok, so you only really
want to stem the
> english.
> >> > This
> >> > > > is
> >> > > > > > > good.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Is it possible for you
to consider using solr?
> solr's
> >> > > > default
> >> > > > > > > > > analyzer
> >> > > > > > > > > > > for
> >> > > > > > > > > > > > type 'text' will be good
for your case. it will do
> the
> >> > > > > > following
> >> > > > > > > > > > > > 1. tokenize on whitespace
> >> > > > > > > > > > > > 2. handle both indian
language and english
> punctuation
> >> > > > > > > > > > > > 3. lowercase the english.
> >> > > > > > > > > > > > 4. stem the english.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > try a nightly build,
> >> > > > > > > > > > >
> http://people.apache.org/builds/lucene/solr/nightly/
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > On Thu, Jun 4, 2009 at
1:12 AM, KK <
> >> > > > > dioxide.software@gmail.com
> >> > > > > > >
> >> > > > > > > > > wrote:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > Muir, thanks for
your response.
> >> > > > > > > > > > > > > I'm indexing indian
language web pages which has
> got
> >> > > > > descent
> >> > > > > > > > amount
> >> > > > > > > > > > of
> >> > > > > > > > > > > > > english content
mixed with therein. For the time
> >> > being
> >> > > > I'm
> >> > > > > > not
> >> > > > > > > > > going
> >> > > > > > > > > > to
> >> > > > > > > > > > > > use
> >> > > > > > > > > > > > > any stemmers as
we don't have standard stemmers
> for
> >> > > > indian
> >> > > > > > > > > languages
> >> > > > > > > > > > .
> >> > > > > > > > > > > > So
> >> > > > > > > > > > > > > what I want to do
is like this,
> >> > > > > > > > > > > > > Say I've a web page
having hindi content with 5%
> >> > > english
> >> > > > > > > content.
> >> > > > > > > > > > Then
> >> > > > > > > > > > > > for
> >> > > > > > > > > > > > > hindi I want to
use the basic white space
> analyzer as
> >> > > we
> >> > > > > dont
> >> > > > > > > > have
> >> > > > > > > > > > > > stemmers
> >> > > > > > > > > > > > > for this as I mentioned
earlier and whereever
> english
> >> > > > > appears
> >> > > > > > I
> >> > > > > > > > > want
> >> > > > > > > > > > > > them
> >> > > > > > > > > > > > > to
> >> > > > > > > > > > > > > be stemmed tokenized
etc[the standard process
> used
> >> > for
> >> > > > > > english
> >> > > > > > > > > > > content].
> >> > > > > > > > > > > > As
> >> > > > > > > > > > > > > of now I'm using
whitespace analyzer for the
> full
> >> > > content
> >> > > > > > which
> >> > > > > > > > > > doesnot
> >> > > > > > > > > > > > > support case folding,
stemming etc for teh
> content.
> >> > So
> >> > > if
> >> > > > > > there
> >> > > > > > > > is
> >> > > > > > > > > an
> >> > > > > > > > > > > > > english word say
"Detection" indexed as such
> then
> >> > > > searching
> >> > > > > > for
> >> > > > > > > > > > > > detection
> >> > > > > > > > > > > > > or
> >> > > > > > > > > > > > > detect is not giving
any results, which is the
> >> > expected
> >> > > > > > > behavior,
> >> > > > > > > > > but
> >> > > > > > > > > > I
> >> > > > > > > > > > > > > want
> >> > > > > > > > > > > > > this kind of queries
to give results.
> >> > > > > > > > > > > > > I hope I made it
clear. Let me know any ideas on
> >> > doing
> >> > > > the
> >> > > > > > > same.
> >> > > > > > > > > And
> >> > > > > > > > > > > one
> >> > > > > > > > > > > > > more thing, I'm
storing the full webpage content
> >> > under
> >> > > a
> >> > > > > > single
> >> > > > > > > > > > field,
> >> > > > > > > > > > > I
> >> > > > > > > > > > > > > hope this will not
make any difference, right?
> >> > > > > > > > > > > > > It seems I've to
use language identifiers, but
> do we
> >> > > > really
> >> > > > > > > need
> >> > > > > > > > > > that?
> >> > > > > > > > > > > > > Because we've only
non-english content mixed
> with
> >> > > > > english[and
> >> > > > > > > not
> >> > > > > > > > > > > french
> >> > > > > > > > > > > > or
> >> > > > > > > > > > > > > russian etc].
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > What is the best
way of approaching the problem?
> Any
> >> > > > > > thoughts!
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Thanks,
> >> > > > > > > > > > > > > KK.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > On Wed, Jun 3, 2009
at 9:42 PM, Robert Muir <
> >> > > > > > rcmuir@gmail.com>
> >> > > > > > > > > > wrote:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > > KK, is all
of your latin script text actually
> >> > > english?
> >> > > > Is
> >> > > > > > > there
> >> > > > > > > > > > stuff
> >> > > > > > > > > > > > > like
> >> > > > > > > > > > > > > > german or french
mixed in?
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > And for your
non-english content (your
> examples
> >> > have
> >> > > > been
> >> > > > > > > > indian
> >> > > > > > > > > > > > writing
> >> > > > > > > > > > > > > > systems), is
it generally true that if you had
> >> > > > > devanagari,
> >> > > > > > > you
> >> > > > > > > > > can
> >> > > > > > > > > > > > assume
> >> > > > > > > > > > > > > > its hindi?
or is there stuff like marathi
> mixed in?
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Reason I say
this is to invoke the right
> stemmers,
> >> > > you
> >> > > > > > really
> >> > > > > > > > > need
> >> > > > > > > > > > > > some
> >> > > > > > > > > > > > > > language detection,
but perhaps in your case
> you
> >> > can
> >> > > > > cheat
> >> > > > > > > and
> >> > > > > > > > > > detect
> >> > > > > > > > > > > > > this
> >> > > > > > > > > > > > > > based on scripts...
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Thanks,
> >> > > > > > > > > > > > > > Robert
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > On Wed, Jun
3, 2009 at 10:15 AM, KK <
> >> > > > > > > > dioxide.software@gmail.com>
> >> > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Hi All,
> >> > > > > > > > > > > > > > > I'm indexing
some non-english content. But
> the
> >> > page
> >> > > > > also
> >> > > > > > > > > contains
> >> > > > > > > > > > > > > english
> >> > > > > > > > > > > > > > > content.
As of now I'm using
> WhitespaceAnalyzer
> >> > for
> >> > > > all
> >> > > > > > > > content
> >> > > > > > > > > > and
> >> > > > > > > > > > > > I'm
> >> > > > > > > > > > > > > > > storing
the full webpage content under a
> single
> >> > > > filed.
> >> > > > > > Now
> >> > > > > > > we
> >> > > > > > > > > > > > require
> >> > > > > > > > > > > > > to
> >> > > > > > > > > > > > > > > support
case folding and stemmming for the
> >> > english
> >> > > > > > content
> >> > > > > > > > > > > > intermingled
> >> > > > > > > > > > > > > > > with
> >> > > > > > > > > > > > > > > non-english
content. I must metion that we
> dont
> >> > > have
> >> > > > > > > stemming
> >> > > > > > > > > and
> >> > > > > > > > > > > > case
> >> > > > > > > > > > > > > > > folding
for these non-english content. I'm
> stuck
> >> > > with
> >> > > > > > this.
> >> > > > > > > > > Some
> >> > > > > > > > > > > one
> >> > > > > > > > > > > > do
> >> > > > > > > > > > > > > > let
> >> > > > > > > > > > > > > > > me know
how to proceed for fixing this
> issue.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Thanks,
> >> > > > > > > > > > > > > > > KK.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > --
> >> > > > > > > > > > > > > > Robert Muir
> >> > > > > > > > > > > > > > rcmuir@gmail.com
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > --
> >> > > > > > > > > > > > Robert Muir
> >> > > > > > > > > > > > rcmuir@gmail.com
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > >
> >> > >
> ---------------------------------------------------------------------
> >> > > > > > > > > > > To unsubscribe, e-mail:
> >> > > > > java-user-unsubscribe@lucene.apache.org
> >> > > > > > > > > > > For additional commands, e-mail:
> >> > > > > > java-user-help@lucene.apache.org
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > > --
> >> > > > > > > > > > Robert Muir
> >> > > > > > > > > > rcmuir@gmail.com
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > --
> >> > > > > > > > Robert Muir
> >> > > > > > > > rcmuir@gmail.com
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Robert Muir
> >> > > > > > rcmuir@gmail.com
> >> > > > > >
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Robert Muir
> >> > > > rcmuir@gmail.com
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Robert Muir
> >> > rcmuir@gmail.com
> >> >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message