lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From KK <dioxide.softw...@gmail.com>
Subject Re: How to support stemming and case folding for english content mixed with non-english content?
Date Thu, 04 Jun 2009 12:53:56 GMT
Thanks Muir.
Thanks for letting me know that I dont need language identifiers.
 I'll have a look and will try to write the analyzer. For my case I think it
wont be that difficult.
BTW, can you point me to some sample codes/tutorials writing custom
analyzers. I could not find something in LIA2ndEdn. Is something htere? do
let me know.

Thanks,
KK.



On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <rcmuir@gmail.com> wrote:

> KK, for your case, you don't really need to go to the effort of detecting
> whether fragments are english or not.
> Because the English stemmers in lucene will not modify your Indic text, and
> neither will the LowerCaseFilter.
>
> what you want to do is create a custom analyzer that works like this
>
> -WhitespaceTokenizer with WordDelimiterFilter [from Solr nightly jar],
> LowerCaseFilter, StopFilter, and PorterStemFilter-
>
> Thanks,
> Robert
>
> On Thu, Jun 4, 2009 at 8:28 AM, KK <dioxide.software@gmail.com> wrote:
>
> > Thank you all.
> > To be frank I was using Solr in the begining half a month ago. The
> > problem[rather bug] with solr was creation of new index on the fly.
> Though
> > they have a restful method for teh same, but it was not working. If I
> > remember properly one of Solr commiter "Noble Paul"[I dont know his real
> > name] was trying to help me. I tried many nightly builds and spending a
> > couple of days stuck at that made me think of lucene and I switched to
> it.
> > Now after working with lucene which gives you full control of everything
> I
> > don't want to switch to Solr.[LOL, to me Solr:Lucene is similar to
> > Window$:Linux, its my view only, though]. Coming back to the point as Uwe
> > mentioned that we can do the same thing in lucene as well, what is
> > available
> > in Solr, Solr is based on Lucene only, right?
> > I request Uwe to give me some more ideas on using the analyzers from solr
> > that will do the job for me, handling a mix of both english and
> non-english
> > content.
> > Muir, can you give me a bit detail description of how to use the
> > WordDelimiteFilter to do my job.
> > On a side note, I was thingking of writing a simple analyzer that will do
> > the following,
> > #. If the webpage fragment is non-english[for me its some indian
> language]
> > then index them as such, no stemming/ stop word removal to begin with. As
> I
> > know its in UCN unicode something like \u0021\u0012\u34ae\u0031[just a
> > sample]
> > # If the fragment is english then apply standard anlyzing process for
> > english content. I've not thought of quering in the same way as of now
> i.e
> > mix of non-english and engish words.
> > Now to get all this,
> >  #1. I need some sort of way which will let me know if the content is
> > english or not. If not english just add the tokens to the document. Do we
> > really need language identifiers, as i dont have any other content that
> > uses
> > the same script as english other than those \u1234 things for my indian
> > language content. Any smart hack/trick for the same?
> >  #2. If the its english apply all normal process and add the stemmed
> token
> > to document.
> > For all this I was thinking of iterating earch word of the web page and
> > apply the above procedure. And finallyadd  the newly created document to
> > the
> > index.
> >
> > I would like some one to guide me in this direction. I'm pretty people
> must
> > have done similar/same thing earlier, I request them to guide me/ point
> me
> > to some tutorials for the same.
> > Else help me out writing a custom analyzer only if thats not going to be
> > too
> > complex. LOL, I'm a new user to lucene and know basics of Java coding.
> > Thank you very much.
> >
> > --KK.
> >
> >
> >
> > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <rcmuir@gmail.com> wrote:
> >
> > > yes this is true. for starters KK, might be good to startup solr and
> look
> > > at
> > > http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> > >
> > > if you want to stick with lucene, the WordDelimiterFilter is the piece
> > you
> > > will want for your text, mainly for punctuation but also for format
> > > characters such as ZWJ/ZWNJ.
> > >
> > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <uwe@thetaphi.de> wrote:
> > >
> > > > You can also re-use the solr analyzers, as far as I found out. There
> is
> > > an
> > > > issue in jIRA/discussion on java-dev to merge them.
> > > >
> > > > -----
> > > > Uwe Schindler
> > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > http://www.thetaphi.de
> > > > eMail: uwe@thetaphi.de
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
> > > > > Sent: Thursday, June 04, 2009 1:18 PM
> > > > > To: java-user@lucene.apache.org
> > > > > Subject: Re: How to support stemming and case folding for english
> > > content
> > > > > mixed with non-english content?
> > > > >
> > > > > KK, ok, so you only really want to stem the english. This is good.
> > > > >
> > > > > Is it possible for you to consider using solr? solr's default
> > analyzer
> > > > for
> > > > > type 'text' will be good for your case. it will do the following
> > > > > 1. tokenize on whitespace
> > > > > 2. handle both indian language and english punctuation
> > > > > 3. lowercase the english.
> > > > > 4. stem the english.
> > > > >
> > > > > try a nightly build,
> > > > http://people.apache.org/builds/lucene/solr/nightly/
> > > > >
> > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <dioxide.software@gmail.com>
> > wrote:
> > > > >
> > > > > > Muir, thanks for your response.
> > > > > > I'm indexing indian language web pages which has got descent
> amount
> > > of
> > > > > > english content mixed with therein. For the time being I'm not
> > going
> > > to
> > > > > use
> > > > > > any stemmers as we don't have standard stemmers for indian
> > languages
> > > .
> > > > > So
> > > > > > what I want to do is like this,
> > > > > > Say I've a web page having hindi content with 5% english content.
> > > Then
> > > > > for
> > > > > > hindi I want to use the basic white space analyzer as we dont
> have
> > > > > stemmers
> > > > > > for this as I mentioned earlier and whereever english appears
I
> > want
> > > > > them
> > > > > > to
> > > > > > be stemmed tokenized etc[the standard process used for english
> > > > content].
> > > > > As
> > > > > > of now I'm using whitespace analyzer for the full content which
> > > doesnot
> > > > > > support case folding, stemming etc for teh content. So if there
> is
> > an
> > > > > > english word say "Detection" indexed as such then searching
for
> > > > > detection
> > > > > > or
> > > > > > detect is not giving any results, which is the expected behavior,
> > but
> > > I
> > > > > > want
> > > > > > this kind of queries to give results.
> > > > > > I hope I made it clear. Let me know any ideas on doing the same.
> > And
> > > > one
> > > > > > more thing, I'm storing the full webpage content under a single
> > > field,
> > > > I
> > > > > > hope this will not make any difference, right?
> > > > > > It seems I've to use language identifiers, but do we really
need
> > > that?
> > > > > > Because we've only non-english content mixed with english[and
not
> > > > french
> > > > > or
> > > > > > russian etc].
> > > > > >
> > > > > > What is the best way of approaching the problem? Any thoughts!
> > > > > >
> > > > > > Thanks,
> > > > > > KK.
> > > > > >
> > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <rcmuir@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > KK, is all of your latin script text actually english?
Is there
> > > stuff
> > > > > > like
> > > > > > > german or french mixed in?
> > > > > > >
> > > > > > > And for your non-english content (your examples have been
> indian
> > > > > writing
> > > > > > > systems), is it generally true that if you had devanagari,
you
> > can
> > > > > assume
> > > > > > > its hindi? or is there stuff like marathi mixed in?
> > > > > > >
> > > > > > > Reason I say this is to invoke the right stemmers, you
really
> > need
> > > > > some
> > > > > > > language detection, but perhaps in your case you can cheat
and
> > > detect
> > > > > > this
> > > > > > > based on scripts...
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Robert
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
> dioxide.software@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi All,
> > > > > > > > I'm indexing some non-english content. But the page
also
> > contains
> > > > > > english
> > > > > > > > content. As of now I'm using WhitespaceAnalyzer for
all
> content
> > > and
> > > > > I'm
> > > > > > > > storing the full webpage content under a single filed.
Now we
> > > > > require
> > > > > > to
> > > > > > > > support case folding and stemmming for the english
content
> > > > > intermingled
> > > > > > > > with
> > > > > > > > non-english content. I must metion that we dont have
stemming
> > and
> > > > > case
> > > > > > > > folding for these non-english content. I'm stuck with
this.
> > Some
> > > > one
> > > > > do
> > > > > > > let
> > > > > > > > me know how to proceed for fixing this issue.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > KK.
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Robert Muir
> > > > > > > rcmuir@gmail.com
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Robert Muir
> > > > > rcmuir@gmail.com
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcmuir@gmail.com
> > >
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message