lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From KK <dioxide.softw...@gmail.com>
Subject Re: How to support stemming and case folding for english content mixed with non-english content?
Date Fri, 05 Jun 2009 15:45:11 GMT
Robert, I tried to use the worddelimiterfilterfactory as well, but I faced
the same problem[I saw solr using the factory instead of the filter]. I
think I sould try using the first option[copyting that to local directory]
and use it with the options you mentioned. I'll try it out and will post it
here.

Thanks,
KK.
On Fri, Jun 5, 2009 at 7:37 PM, Robert Muir <rcmuir@gmail.com> wrote:

> kk an easier solution to your first problem is to use
> worddelimiterfilterfactory if possible... you can get an instance of
> worddelimiter filter from that.
>
> thanks,
> robert
>
> On Fri, Jun 5, 2009 at 10:06 AM, Robert Muir<rcmuir@gmail.com> wrote:
> > kk as for your first issue, that WordDelimiterFilter is package
> > protected, one option is to make a copy of the code and change the
> > class declaration to public.
> > the other option is to put your entire analyzer in
> > 'org.apache.solr.analysis' package so that you can access it...
> >
> > for the 2nd issue, yes you need to supply some options to it. the
> > default options solr applies to type 'text' seemed to work well for me
> > with indic:
> >
> > {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1,
> > generateWordParts=1, catenateAll=0, catenateNumbers=1}
> >
> > On Fri, Jun 5, 2009 at 9:12 AM, KK <dioxide.software@gmail.com> wrote:
> >>
> >> Thanks Robert. There is one problem though, I'm able to plugin the word
> >> delimiter filter from solr-nightly jar file. When I tried to do
> something
> >> like,
> >>  TokenStream ts = new WhitespaceTokenizer(reader);
> >>   ts = new WordDelimiterFilter(ts);
> >>   ts = new PorterStemmerFilter(ts);
> >>   ...rest as in the last mail...
> >>
> >> It gave me an error saying that
> >>
> >> org.apache.solr.analysis.WordDelimiterFilter is not public in
> >> org.apache.solr.analysis; cannot be accessed from outside package
> >> import org.apache.solr.analysis.WordDelimiterFilter;
> >>                               ^
> >> solrSearch/IndicAnalyzer.java:38: cannot find symbol
> >> symbol  : class WordDelimiterFilter
> >> location: class solrSearch.IndicAnalyzer
> >>    ts = new WordDelimiterFilter(ts);
> >>             ^
> >> 2 errors
> >>
> >> Then i tried to see the code for worddelimitefiter from solrnightly src
> and
> >> found that there are many deprecated constructors though they require a
> lot
> >> of parameters alongwith tokenstream. I went through the solr wiki for
> >> worddelimiterfilterfactory here,
> >>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089
> >> and say that there also its specified that we've to mention the
> parameters
> >> and both are different for indexing and querying.
> >> I'm kind of stuck here, how do I make use of worddelimiterfilter in my
> >> custom analyzer, I've to use it anyway.
> >> In my code I've to make use of worddelimiterfilter and not
> >> worddelimiterfilterfactory, right? I don't know whats the use of the
> other
> >> one. Anyway can you guide me getting rid of the above error. And yes
> I'll
> >> change the order of applying the filters as you said.
> >>
> >> Thanks,
> >> KK.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Jun 5, 2009 at 5:48 PM, Robert Muir <rcmuir@gmail.com> wrote:
> >>
> >> > KK, you got the right idea.
> >> >
> >> > though I think you might want to change the order, move the stopfilter
> >> > before the porter stem filter... otherwise it might not work
> correctly.
> >> >
> >> > On Fri, Jun 5, 2009 at 8:05 AM, KK <dioxide.software@gmail.com>
> wrote:
> >> >
> >> > > Thanks Robert. This is exactly what I did and  its working but
> delimiter
> >> > is
> >> > > missing I'm going to add that from solr-nightly.jar
> >> > >
> >> > > /**
> >> > >  * Analyzer for Indian language.
> >> > >  */
> >> > > public class IndicAnalyzer extends Analyzer {
> >> > >  public TokenStream tokenStream(String fieldName, Reader reader) {
> >> > >     TokenStream ts = new WhitespaceTokenizer(reader);
> >> > >    ts = new PorterStemFilter(ts);
> >> > >    ts = new LowerCaseFilter(ts);
> >> > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> >> > >    return ts;
> >> > >  }
> >> > > }
> >> > >
> >> > > Its able to do stemming/case-folding and supports search for both
> english
> >> > > and indic texts. let me try out the delimiter. Will update you on
> that.
> >> > >
> >> > > Thanks a lot.
> >> > > KK
> >> > >
> >> > > On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rcmuir@gmail.com>
> wrote:
> >> > >
> >> > > > i think you are on the right track... once you build your
> analyzer, put
> >> > > it
> >> > > > in your classpath and play around with it in luke and see if
it
> does
> >> > what
> >> > > > you want.
> >> > > >
> >> > > > On Fri, Jun 5, 2009 at 3:19 AM, KK <dioxide.software@gmail.com>
> wrote:
> >> > > >
> >> > > > > Hi Robert,
> >> > > > > This is what I copied from ThaiAnalyzer @ lucene contrib
> >> > > > >
> >> > > > > public class ThaiAnalyzer extends Analyzer {
> >> > > > >  public TokenStream tokenStream(String fieldName, Reader
reader)
> {
> >> > > > >      TokenStream ts = new StandardTokenizer(reader);
> >> > > > >    ts = new StandardFilter(ts);
> >> > > > >    ts = new ThaiWordFilter(ts);
> >> > > > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> >> > > > >    return ts;
> >> > > > >  }
> >> > > > > }
> >> > > > >
> >> > > > > Now as you said, I've to use whitespacetokenizer
> >> > > > > withworddelimitefilter[solr
> >> > > > > nightly.jar] stop wordremoval, porter stemmer etc , so it
is
> >> > something
> >> > > > like
> >> > > > > this,
> >> > > > > public class IndicAnalyzer extends Analyzer {
> >> > > > >  public TokenStream tokenStream(String fieldName, Reader
reader)
> {
> >> > > > >   TokenStream ts = new WhiteSpaceTokenizer(reader);
> >> > > > >   ts = new WordDelimiterFilter(ts);
> >> > > > >   ts = new LowerCaseFilter(ts);
> >> > > > >   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)
  //
> >> > english
> >> > > > > stop filter, is this the default one?
> >> > > > >   ts = new PorterFilter(ts);
> >> > > > >   return ts;
> >> > > > >  }
> >> > > > > }
> >> > > > >
> >> > > > > Does this sound OK? I think it will do the job...let me
try it
> out..
> >> > > > > I dont need custom filter as per my requirement, at least
not
> for
> >> > these
> >> > > > > basic things I'm doing? I think so...
> >> > > > >
> >> > > > > Thanks,
> >> > > > > KK.
> >> > > > >
> >> > > > >
> >> > > > > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <rcmuir@gmail.com>
> >> > wrote:
> >> > > > >
> >> > > > > > KK well you can always get some good examples from
the lucene
> >> > contrib
> >> > > > > > codebase.
> >> > > > > > For example, look at the DutchAnalyzer, especially:
> >> > > > > >
> >> > > > > > TokenStream tokenStream(String fieldName, Reader reader)
> >> > > > > >
> >> > > > > > See how it combines a specified tokenizer with various
> filters?
> >> > this
> >> > > is
> >> > > > > > what
> >> > > > > > you want to do, except of course you want to use different
> >> > tokenizer
> >> > > > and
> >> > > > > > filters.
> >> > > > > >
> >> > > > > > On Thu, Jun 4, 2009 at 8:53 AM, KK <
> dioxide.software@gmail.com>
> >> > > wrote:
> >> > > > > >
> >> > > > > > > Thanks Muir.
> >> > > > > > > Thanks for letting me know that I dont need language
> identifiers.
> >> > > > > > >  I'll have a look and will try to write the analyzer.
For my
> case
> >> > I
> >> > > > > think
> >> > > > > > > it
> >> > > > > > > wont be that difficult.
> >> > > > > > > BTW, can you point me to some sample codes/tutorials
writing
> >> > custom
> >> > > > > > > analyzers. I could not find something in LIA2ndEdn.
Is
> something
> >> > > > htere?
> >> > > > > > do
> >> > > > > > > let me know.
> >> > > > > > >
> >> > > > > > > Thanks,
> >> > > > > > > KK.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <
> rcmuir@gmail.com>
> >> > > > wrote:
> >> > > > > > >
> >> > > > > > > > KK, for your case, you don't really need
to go to the
> effort of
> >> > > > > > detecting
> >> > > > > > > > whether fragments are english or not.
> >> > > > > > > > Because the English stemmers in lucene will
not modify
> your
> >> > Indic
> >> > > > > text,
> >> > > > > > > and
> >> > > > > > > > neither will the LowerCaseFilter.
> >> > > > > > > >
> >> > > > > > > > what you want to do is create a custom analyzer
that works
> like
> >> > > > this
> >> > > > > > > >
> >> > > > > > > > -WhitespaceTokenizer with WordDelimiterFilter
[from Solr
> >> > nightly
> >> > > > > jar],
> >> > > > > > > > LowerCaseFilter, StopFilter, and PorterStemFilter-
> >> > > > > > > >
> >> > > > > > > > Thanks,
> >> > > > > > > > Robert
> >> > > > > > > >
> >> > > > > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <
> dioxide.software@gmail.com
> >> > >
> >> > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Thank you all.
> >> > > > > > > > > To be frank I was using Solr in the
begining half a
> month
> >> > ago.
> >> > > > The
> >> > > > > > > > > problem[rather bug] with solr was creation
of new index
> on
> >> > the
> >> > > > fly.
> >> > > > > > > > Though
> >> > > > > > > > > they have a restful method for teh same,
but it was not
> >> > > working.
> >> > > > If
> >> > > > > I
> >> > > > > > > > > remember properly one of Solr commiter
"Noble Paul"[I
> dont
> >> > know
> >> > > > his
> >> > > > > > > real
> >> > > > > > > > > name] was trying to help me. I tried
many nightly builds
> and
> >> > > > > spending
> >> > > > > > a
> >> > > > > > > > > couple of days stuck at that made me
think of lucene and
> I
> >> > > > switched
> >> > > > > > to
> >> > > > > > > > it.
> >> > > > > > > > > Now after working with lucene which
gives you full
> control of
> >> > > > > > > everything
> >> > > > > > > > I
> >> > > > > > > > > don't want to switch to Solr.[LOL, to
me Solr:Lucene is
> >> > similar
> >> > > > to
> >> > > > > > > > > Window$:Linux, its my view only, though].
Coming back to
> the
> >> > > > point
> >> > > > > as
> >> > > > > > > Uwe
> >> > > > > > > > > mentioned that we can do the same thing
in lucene as
> well,
> >> > what
> >> > > > is
> >> > > > > > > > > available
> >> > > > > > > > > in Solr, Solr is based on Lucene only,
right?
> >> > > > > > > > > I request Uwe to give me some more ideas
on using the
> >> > analyzers
> >> > > > > from
> >> > > > > > > solr
> >> > > > > > > > > that will do the job for me, handling
a mix of both
> english
> >> > and
> >> > > > > > > > non-english
> >> > > > > > > > > content.
> >> > > > > > > > > Muir, can you give me a bit detail description
of how to
> use
> >> > > the
> >> > > > > > > > > WordDelimiteFilter to do my job.
> >> > > > > > > > > On a side note, I was thingking of writing
a simple
> analyzer
> >> > > that
> >> > > > > > will
> >> > > > > > > do
> >> > > > > > > > > the following,
> >> > > > > > > > > #. If the webpage fragment is non-english[for
me its
> some
> >> > > indian
> >> > > > > > > > language]
> >> > > > > > > > > then index them as such, no stemming/
stop word removal
> to
> >> > > begin
> >> > > > > > with.
> >> > > > > > > As
> >> > > > > > > > I
> >> > > > > > > > > know its in UCN unicode something like
> >> > > > > \u0021\u0012\u34ae\u0031[just
> >> > > > > > a
> >> > > > > > > > > sample]
> >> > > > > > > > > # If the fragment is english then apply
standard
> anlyzing
> >> > > process
> >> > > > > for
> >> > > > > > > > > english content. I've not thought of
quering in the same
> way
> >> > as
> >> > > > of
> >> > > > > > now
> >> > > > > > > > i.e
> >> > > > > > > > > mix of non-english and engish words.
> >> > > > > > > > > Now to get all this,
> >> > > > > > > > >  #1. I need some sort of way which will
let me know if
> the
> >> > > > content
> >> > > > > is
> >> > > > > > > > > english or not. If not english just
add the tokens to
> the
> >> > > > document.
> >> > > > > > Do
> >> > > > > > > we
> >> > > > > > > > > really need language identifiers, as
i dont have any
> other
> >> > > > content
> >> > > > > > that
> >> > > > > > > > > uses
> >> > > > > > > > > the same script as english other than
those \u1234
> things for
> >> > > my
> >> > > > > > indian
> >> > > > > > > > > language content. Any smart hack/trick
for the same?
> >> > > > > > > > >  #2. If the its english apply all normal
process and add
> the
> >> > > > > stemmed
> >> > > > > > > > token
> >> > > > > > > > > to document.
> >> > > > > > > > > For all this I was thinking of iterating
earch word of
> the
> >> > web
> >> > > > page
> >> > > > > > and
> >> > > > > > > > > apply the above procedure. And finallyadd
 the newly
> created
> >> > > > > document
> >> > > > > > > to
> >> > > > > > > > > the
> >> > > > > > > > > index.
> >> > > > > > > > >
> >> > > > > > > > > I would like some one to guide me in
this direction. I'm
> >> > pretty
> >> > > > > > people
> >> > > > > > > > must
> >> > > > > > > > > have done similar/same thing earlier,
I request them to
> guide
> >> > > me/
> >> > > > > > point
> >> > > > > > > > me
> >> > > > > > > > > to some tutorials for the same.
> >> > > > > > > > > Else help me out writing a custom analyzer
only if thats
> not
> >> > > > going
> >> > > > > to
> >> > > > > > > be
> >> > > > > > > > > too
> >> > > > > > > > > complex. LOL, I'm a new user to lucene
and know basics
> of
> >> > Java
> >> > > > > > coding.
> >> > > > > > > > > Thank you very much.
> >> > > > > > > > >
> >> > > > > > > > > --KK.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert
Muir <
> >> > rcmuir@gmail.com>
> >> > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > > > yes this is true. for starters
KK, might be good to
> startup
> >> > > > solr
> >> > > > > > and
> >> > > > > > > > look
> >> > > > > > > > > > at
> >> > > > > > > > > >
> http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> >> > > > > > > > > >
> >> > > > > > > > > > if you want to stick with lucene,
the
> WordDelimiterFilter
> >> > is
> >> > > > the
> >> > > > > > > piece
> >> > > > > > > > > you
> >> > > > > > > > > > will want for your text, mainly
for punctuation but
> also
> >> > for
> >> > > > > format
> >> > > > > > > > > > characters such as ZWJ/ZWNJ.
> >> > > > > > > > > >
> >> > > > > > > > > > On Thu, Jun 4, 2009 at 7:51 AM,
Uwe Schindler <
> >> > > uwe@thetaphi.de
> >> > > > >
> >> > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > You can also re-use the solr
analyzers, as far as I
> found
> >> > > > out.
> >> > > > > > > There
> >> > > > > > > > is
> >> > > > > > > > > > an
> >> > > > > > > > > > > issue in jIRA/discussion on
java-dev to merge them.
> >> > > > > > > > > > >
> >> > > > > > > > > > > -----
> >> > > > > > > > > > > Uwe Schindler
> >> > > > > > > > > > > H.-H.-Meier-Allee 63, D-28213
Bremen
> >> > > > > > > > > > > http://www.thetaphi.de
> >> > > > > > > > > > > eMail: uwe@thetaphi.de
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > > -----Original Message-----
> >> > > > > > > > > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
> >> > > > > > > > > > > > Sent: Thursday, June
04, 2009 1:18 PM
> >> > > > > > > > > > > > To: java-user@lucene.apache.org
> >> > > > > > > > > > > > Subject: Re: How to support
stemming and case
> folding
> >> > for
> >> > > > > > english
> >> > > > > > > > > > content
> >> > > > > > > > > > > > mixed with non-english
content?
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > KK, ok, so you only really
want to stem the
> english.
> >> > This
> >> > > > is
> >> > > > > > > good.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Is it possible for you
to consider using solr?
> solr's
> >> > > > default
> >> > > > > > > > > analyzer
> >> > > > > > > > > > > for
> >> > > > > > > > > > > > type 'text' will be good
for your case. it will do
> the
> >> > > > > > following
> >> > > > > > > > > > > > 1. tokenize on whitespace
> >> > > > > > > > > > > > 2. handle both indian
language and english
> punctuation
> >> > > > > > > > > > > > 3. lowercase the english.
> >> > > > > > > > > > > > 4. stem the english.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > try a nightly build,
> >> > > > > > > > > > >
> http://people.apache.org/builds/lucene/solr/nightly/
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > On Thu, Jun 4, 2009 at
1:12 AM, KK <
> >> > > > > dioxide.software@gmail.com
> >> > > > > > >
> >> > > > > > > > > wrote:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > Muir, thanks for
your response.
> >> > > > > > > > > > > > > I'm indexing indian
language web pages which has
> got
> >> > > > > descent
> >> > > > > > > > amount
> >> > > > > > > > > > of
> >> > > > > > > > > > > > > english content
mixed with therein. For the time
> >> > being
> >> > > > I'm
> >> > > > > > not
> >> > > > > > > > > going
> >> > > > > > > > > > to
> >> > > > > > > > > > > > use
> >> > > > > > > > > > > > > any stemmers as
we don't have standard stemmers
> for
> >> > > > indian
> >> > > > > > > > > languages
> >> > > > > > > > > > .
> >> > > > > > > > > > > > So
> >> > > > > > > > > > > > > what I want to do
is like this,
> >> > > > > > > > > > > > > Say I've a web page
having hindi content with 5%
> >> > > english
> >> > > > > > > content.
> >> > > > > > > > > > Then
> >> > > > > > > > > > > > for
> >> > > > > > > > > > > > > hindi I want to
use the basic white space
> analyzer as
> >> > > we
> >> > > > > dont
> >> > > > > > > > have
> >> > > > > > > > > > > > stemmers
> >> > > > > > > > > > > > > for this as I mentioned
earlier and whereever
> english
> >> > > > > appears
> >> > > > > > I
> >> > > > > > > > > want
> >> > > > > > > > > > > > them
> >> > > > > > > > > > > > > to
> >> > > > > > > > > > > > > be stemmed tokenized
etc[the standard process
> used
> >> > for
> >> > > > > > english
> >> > > > > > > > > > > content].
> >> > > > > > > > > > > > As
> >> > > > > > > > > > > > > of now I'm using
whitespace analyzer for the
> full
> >> > > content
> >> > > > > > which
> >> > > > > > > > > > doesnot
> >> > > > > > > > > > > > > support case folding,
stemming etc for teh
> content.
> >> > So
> >> > > if
> >> > > > > > there
> >> > > > > > > > is
> >> > > > > > > > > an
> >> > > > > > > > > > > > > english word say
"Detection" indexed as such
> then
> >> > > > searching
> >> > > > > > for
> >> > > > > > > > > > > > detection
> >> > > > > > > > > > > > > or
> >> > > > > > > > > > > > > detect is not giving
any results, which is the
> >> > expected
> >> > > > > > > behavior,
> >> > > > > > > > > but
> >> > > > > > > > > > I
> >> > > > > > > > > > > > > want
> >> > > > > > > > > > > > > this kind of queries
to give results.
> >> > > > > > > > > > > > > I hope I made it
clear. Let me know any ideas on
> >> > doing
> >> > > > the
> >> > > > > > > same.
> >> > > > > > > > > And
> >> > > > > > > > > > > one
> >> > > > > > > > > > > > > more thing, I'm
storing the full webpage content
> >> > under
> >> > > a
> >> > > > > > single
> >> > > > > > > > > > field,
> >> > > > > > > > > > > I
> >> > > > > > > > > > > > > hope this will not
make any difference, right?
> >> > > > > > > > > > > > > It seems I've to
use language identifiers, but
> do we
> >> > > > really
> >> > > > > > > need
> >> > > > > > > > > > that?
> >> > > > > > > > > > > > > Because we've only
non-english content mixed
> with
> >> > > > > english[and
> >> > > > > > > not
> >> > > > > > > > > > > french
> >> > > > > > > > > > > > or
> >> > > > > > > > > > > > > russian etc].
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > What is the best
way of approaching the problem?
> Any
> >> > > > > > thoughts!
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Thanks,
> >> > > > > > > > > > > > > KK.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > On Wed, Jun 3, 2009
at 9:42 PM, Robert Muir <
> >> > > > > > rcmuir@gmail.com>
> >> > > > > > > > > > wrote:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > > KK, is all
of your latin script text actually
> >> > > english?
> >> > > > Is
> >> > > > > > > there
> >> > > > > > > > > > stuff
> >> > > > > > > > > > > > > like
> >> > > > > > > > > > > > > > german or french
mixed in?
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > And for your
non-english content (your
> examples
> >> > have
> >> > > > been
> >> > > > > > > > indian
> >> > > > > > > > > > > > writing
> >> > > > > > > > > > > > > > systems), is
it generally true that if you had
> >> > > > > devanagari,
> >> > > > > > > you
> >> > > > > > > > > can
> >> > > > > > > > > > > > assume
> >> > > > > > > > > > > > > > its hindi?
or is there stuff like marathi
> mixed in?
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Reason I say
this is to invoke the right
> stemmers,
> >> > > you
> >> > > > > > really
> >> > > > > > > > > need
> >> > > > > > > > > > > > some
> >> > > > > > > > > > > > > > language detection,
but perhaps in your case
> you
> >> > can
> >> > > > > cheat
> >> > > > > > > and
> >> > > > > > > > > > detect
> >> > > > > > > > > > > > > this
> >> > > > > > > > > > > > > > based on scripts...
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Thanks,
> >> > > > > > > > > > > > > > Robert
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > On Wed, Jun
3, 2009 at 10:15 AM, KK <
> >> > > > > > > > dioxide.software@gmail.com>
> >> > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Hi All,
> >> > > > > > > > > > > > > > > I'm indexing
some non-english content. But
> the
> >> > page
> >> > > > > also
> >> > > > > > > > > contains
> >> > > > > > > > > > > > > english
> >> > > > > > > > > > > > > > > content.
As of now I'm using
> WhitespaceAnalyzer
> >> > for
> >> > > > all
> >> > > > > > > > content
> >> > > > > > > > > > and
> >> > > > > > > > > > > > I'm
> >> > > > > > > > > > > > > > > storing
the full webpage content under a
> single
> >> > > > filed.
> >> > > > > > Now
> >> > > > > > > we
> >> > > > > > > > > > > > require
> >> > > > > > > > > > > > > to
> >> > > > > > > > > > > > > > > support
case folding and stemmming for the
> >> > english
> >> > > > > > content
> >> > > > > > > > > > > > intermingled
> >> > > > > > > > > > > > > > > with
> >> > > > > > > > > > > > > > > non-english
content. I must metion that we
> dont
> >> > > have
> >> > > > > > > stemming
> >> > > > > > > > > and
> >> > > > > > > > > > > > case
> >> > > > > > > > > > > > > > > folding
for these non-english content. I'm
> stuck
> >> > > with
> >> > > > > > this.
> >> > > > > > > > > Some
> >> > > > > > > > > > > one
> >> > > > > > > > > > > > do
> >> > > > > > > > > > > > > > let
> >> > > > > > > > > > > > > > > me know
how to proceed for fixing this
> issue.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Thanks,
> >> > > > > > > > > > > > > > > KK.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > --
> >> > > > > > > > > > > > > > Robert Muir
> >> > > > > > > > > > > > > > rcmuir@gmail.com
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > --
> >> > > > > > > > > > > > Robert Muir
> >> > > > > > > > > > > > rcmuir@gmail.com
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > >
> >> > >
> ---------------------------------------------------------------------
> >> > > > > > > > > > > To unsubscribe, e-mail:
> >> > > > > java-user-unsubscribe@lucene.apache.org
> >> > > > > > > > > > > For additional commands, e-mail:
> >> > > > > > java-user-help@lucene.apache.org
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > > --
> >> > > > > > > > > > Robert Muir
> >> > > > > > > > > > rcmuir@gmail.com
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > --
> >> > > > > > > > Robert Muir
> >> > > > > > > > rcmuir@gmail.com
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Robert Muir
> >> > > > > > rcmuir@gmail.com
> >> > > > > >
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Robert Muir
> >> > > > rcmuir@gmail.com
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Robert Muir
> >> > rcmuir@gmail.com
> >> >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message