lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From KK <dioxide.softw...@gmail.com>
Subject Re: How to support stemming and case folding for english content mixed with non-english content?
Date Fri, 05 Jun 2009 06:24:32 GMT
Hello Robert,
I was thinking of kind of chaining analyzers, does this sound logical?
Currently I'm using the whitespace analyzer which tokenizes on whitespaces
only. As you mentioned earlier I dont need to use language identifiers,
which means I've to pass the full content through say first via
whitespaceanalyzer, then via analyzer for lowercasing, then analyzer for
stemming[do we have such analyzers or are they filters?I think they are
filters, then I've to go for custom analyzer]. As you said whitespace one
will act on the full content, lowercasing will aplly only on the english
part leaving my indic content untouched, right? and finally the stemming
will also apply to english words only and leaving the rest as such, right?
If passing the content through multiple analyzer is allowed then I thing I
can do else I have to have a custom analyzer that does exactly the same
thing. What you say?

Thanks,
KK.



On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <rcmuir@gmail.com> wrote:

> KK well you can always get some good examples from the lucene contrib
> codebase.
> For example, look at the DutchAnalyzer, especially:
>
> TokenStream tokenStream(String fieldName, Reader reader)
>
> See how it combines a specified tokenizer with various filters? this is
> what
> you want to do, except of course you want to use different tokenizer and
> filters.
>
> On Thu, Jun 4, 2009 at 8:53 AM, KK <dioxide.software@gmail.com> wrote:
>
> > Thanks Muir.
> > Thanks for letting me know that I dont need language identifiers.
> >  I'll have a look and will try to write the analyzer. For my case I think
> > it
> > wont be that difficult.
> > BTW, can you point me to some sample codes/tutorials writing custom
> > analyzers. I could not find something in LIA2ndEdn. Is something htere?
> do
> > let me know.
> >
> > Thanks,
> > KK.
> >
> >
> >
> > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <rcmuir@gmail.com> wrote:
> >
> > > KK, for your case, you don't really need to go to the effort of
> detecting
> > > whether fragments are english or not.
> > > Because the English stemmers in lucene will not modify your Indic text,
> > and
> > > neither will the LowerCaseFilter.
> > >
> > > what you want to do is create a custom analyzer that works like this
> > >
> > > -WhitespaceTokenizer with WordDelimiterFilter [from Solr nightly jar],
> > > LowerCaseFilter, StopFilter, and PorterStemFilter-
> > >
> > > Thanks,
> > > Robert
> > >
> > > On Thu, Jun 4, 2009 at 8:28 AM, KK <dioxide.software@gmail.com> wrote:
> > >
> > > > Thank you all.
> > > > To be frank I was using Solr in the begining half a month ago. The
> > > > problem[rather bug] with solr was creation of new index on the fly.
> > > Though
> > > > they have a restful method for teh same, but it was not working. If I
> > > > remember properly one of Solr commiter "Noble Paul"[I dont know his
> > real
> > > > name] was trying to help me. I tried many nightly builds and spending
> a
> > > > couple of days stuck at that made me think of lucene and I switched
> to
> > > it.
> > > > Now after working with lucene which gives you full control of
> > everything
> > > I
> > > > don't want to switch to Solr.[LOL, to me Solr:Lucene is similar to
> > > > Window$:Linux, its my view only, though]. Coming back to the point as
> > Uwe
> > > > mentioned that we can do the same thing in lucene as well, what is
> > > > available
> > > > in Solr, Solr is based on Lucene only, right?
> > > > I request Uwe to give me some more ideas on using the analyzers from
> > solr
> > > > that will do the job for me, handling a mix of both english and
> > > non-english
> > > > content.
> > > > Muir, can you give me a bit detail description of how to use the
> > > > WordDelimiteFilter to do my job.
> > > > On a side note, I was thingking of writing a simple analyzer that
> will
> > do
> > > > the following,
> > > > #. If the webpage fragment is non-english[for me its some indian
> > > language]
> > > > then index them as such, no stemming/ stop word removal to begin
> with.
> > As
> > > I
> > > > know its in UCN unicode something like \u0021\u0012\u34ae\u0031[just
> a
> > > > sample]
> > > > # If the fragment is english then apply standard anlyzing process for
> > > > english content. I've not thought of quering in the same way as of
> now
> > > i.e
> > > > mix of non-english and engish words.
> > > > Now to get all this,
> > > >  #1. I need some sort of way which will let me know if the content is
> > > > english or not. If not english just add the tokens to the document.
> Do
> > we
> > > > really need language identifiers, as i dont have any other content
> that
> > > > uses
> > > > the same script as english other than those \u1234 things for my
> indian
> > > > language content. Any smart hack/trick for the same?
> > > >  #2. If the its english apply all normal process and add the stemmed
> > > token
> > > > to document.
> > > > For all this I was thinking of iterating earch word of the web page
> and
> > > > apply the above procedure. And finallyadd  the newly created document
> > to
> > > > the
> > > > index.
> > > >
> > > > I would like some one to guide me in this direction. I'm pretty
> people
> > > must
> > > > have done similar/same thing earlier, I request them to guide me/
> point
> > > me
> > > > to some tutorials for the same.
> > > > Else help me out writing a custom analyzer only if thats not going to
> > be
> > > > too
> > > > complex. LOL, I'm a new user to lucene and know basics of Java
> coding.
> > > > Thank you very much.
> > > >
> > > > --KK.
> > > >
> > > >
> > > >
> > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <rcmuir@gmail.com>
> wrote:
> > > >
> > > > > yes this is true. for starters KK, might be good to startup solr
> and
> > > look
> > > > > at
> > > > > http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> > > > >
> > > > > if you want to stick with lucene, the WordDelimiterFilter is the
> > piece
> > > > you
> > > > > will want for your text, mainly for punctuation but also for format
> > > > > characters such as ZWJ/ZWNJ.
> > > > >
> > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <uwe@thetaphi.de>
> > wrote:
> > > > >
> > > > > > You can also re-use the solr analyzers, as far as I found out.
> > There
> > > is
> > > > > an
> > > > > > issue in jIRA/discussion on java-dev to merge them.
> > > > > >
> > > > > > -----
> > > > > > Uwe Schindler
> > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > > > http://www.thetaphi.de
> > > > > > eMail: uwe@thetaphi.de
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
> > > > > > > Sent: Thursday, June 04, 2009 1:18 PM
> > > > > > > To: java-user@lucene.apache.org
> > > > > > > Subject: Re: How to support stemming and case folding for
> english
> > > > > content
> > > > > > > mixed with non-english content?
> > > > > > >
> > > > > > > KK, ok, so you only really want to stem the english. This
is
> > good.
> > > > > > >
> > > > > > > Is it possible for you to consider using solr? solr's default
> > > > analyzer
> > > > > > for
> > > > > > > type 'text' will be good for your case. it will do the
> following
> > > > > > > 1. tokenize on whitespace
> > > > > > > 2. handle both indian language and english punctuation
> > > > > > > 3. lowercase the english.
> > > > > > > 4. stem the english.
> > > > > > >
> > > > > > > try a nightly build,
> > > > > > http://people.apache.org/builds/lucene/solr/nightly/
> > > > > > >
> > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <dioxide.software@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Muir, thanks for your response.
> > > > > > > > I'm indexing indian language web pages which has got
descent
> > > amount
> > > > > of
> > > > > > > > english content mixed with therein. For the time being
I'm
> not
> > > > going
> > > > > to
> > > > > > > use
> > > > > > > > any stemmers as we don't have standard stemmers for
indian
> > > > languages
> > > > > .
> > > > > > > So
> > > > > > > > what I want to do is like this,
> > > > > > > > Say I've a web page having hindi content with 5% english
> > content.
> > > > > Then
> > > > > > > for
> > > > > > > > hindi I want to use the basic white space analyzer
as we dont
> > > have
> > > > > > > stemmers
> > > > > > > > for this as I mentioned earlier and whereever english
appears
> I
> > > > want
> > > > > > > them
> > > > > > > > to
> > > > > > > > be stemmed tokenized etc[the standard process used
for
> english
> > > > > > content].
> > > > > > > As
> > > > > > > > of now I'm using whitespace analyzer for the full
content
> which
> > > > > doesnot
> > > > > > > > support case folding, stemming etc for teh content.
So if
> there
> > > is
> > > > an
> > > > > > > > english word say "Detection" indexed as such then
searching
> for
> > > > > > > detection
> > > > > > > > or
> > > > > > > > detect is not giving any results, which is the expected
> > behavior,
> > > > but
> > > > > I
> > > > > > > > want
> > > > > > > > this kind of queries to give results.
> > > > > > > > I hope I made it clear. Let me know any ideas on doing
the
> > same.
> > > > And
> > > > > > one
> > > > > > > > more thing, I'm storing the full webpage content under
a
> single
> > > > > field,
> > > > > > I
> > > > > > > > hope this will not make any difference, right?
> > > > > > > > It seems I've to use language identifiers, but do
we really
> > need
> > > > > that?
> > > > > > > > Because we've only non-english content mixed with
english[and
> > not
> > > > > > french
> > > > > > > or
> > > > > > > > russian etc].
> > > > > > > >
> > > > > > > > What is the best way of approaching the problem? Any
> thoughts!
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > KK.
> > > > > > > >
> > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <
> rcmuir@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > KK, is all of your latin script text actually
english? Is
> > there
> > > > > stuff
> > > > > > > > like
> > > > > > > > > german or french mixed in?
> > > > > > > > >
> > > > > > > > > And for your non-english content (your examples
have been
> > > indian
> > > > > > > writing
> > > > > > > > > systems), is it generally true that if you had
devanagari,
> > you
> > > > can
> > > > > > > assume
> > > > > > > > > its hindi? or is there stuff like marathi mixed
in?
> > > > > > > > >
> > > > > > > > > Reason I say this is to invoke the right stemmers,
you
> really
> > > > need
> > > > > > > some
> > > > > > > > > language detection, but perhaps in your case
you can cheat
> > and
> > > > > detect
> > > > > > > > this
> > > > > > > > > based on scripts...
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Robert
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
> > > dioxide.software@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi All,
> > > > > > > > > > I'm indexing some non-english content. But
the page also
> > > > contains
> > > > > > > > english
> > > > > > > > > > content. As of now I'm using WhitespaceAnalyzer
for all
> > > content
> > > > > and
> > > > > > > I'm
> > > > > > > > > > storing the full webpage content under a
single filed.
> Now
> > we
> > > > > > > require
> > > > > > > > to
> > > > > > > > > > support case folding and stemmming for the
english
> content
> > > > > > > intermingled
> > > > > > > > > > with
> > > > > > > > > > non-english content. I must metion that
we dont have
> > stemming
> > > > and
> > > > > > > case
> > > > > > > > > > folding for these non-english content. I'm
stuck with
> this.
> > > > Some
> > > > > > one
> > > > > > > do
> > > > > > > > > let
> > > > > > > > > > me know how to proceed for fixing this issue.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > KK.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Robert Muir
> > > > > > > > > rcmuir@gmail.com
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Robert Muir
> > > > > > > rcmuir@gmail.com
> > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Robert Muir
> > > > > rcmuir@gmail.com
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcmuir@gmail.com
> > >
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message