lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: How to support stemming and case folding for english content mixed with non-english content?
Date Thu, 04 Jun 2009 11:51:00 GMT
You can also re-use the solr analyzers, as far as I found out. There is an
issue in jIRA/discussion on java-dev to merge them.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Thursday, June 04, 2009 1:18 PM
> To: java-user@lucene.apache.org
> Subject: Re: How to support stemming and case folding for english content
> mixed with non-english content?
> 
> KK, ok, so you only really want to stem the english. This is good.
> 
> Is it possible for you to consider using solr? solr's default analyzer for
> type 'text' will be good for your case. it will do the following
> 1. tokenize on whitespace
> 2. handle both indian language and english punctuation
> 3. lowercase the english.
> 4. stem the english.
> 
> try a nightly build, http://people.apache.org/builds/lucene/solr/nightly/
> 
> On Thu, Jun 4, 2009 at 1:12 AM, KK <dioxide.software@gmail.com> wrote:
> 
> > Muir, thanks for your response.
> > I'm indexing indian language web pages which has got descent amount of
> > english content mixed with therein. For the time being I'm not going to
> use
> > any stemmers as we don't have standard stemmers for indian languages .
> So
> > what I want to do is like this,
> > Say I've a web page having hindi content with 5% english content. Then
> for
> > hindi I want to use the basic white space analyzer as we dont have
> stemmers
> > for this as I mentioned earlier and whereever english appears I want
> them
> > to
> > be stemmed tokenized etc[the standard process used for english content].
> As
> > of now I'm using whitespace analyzer for the full content which doesnot
> > support case folding, stemming etc for teh content. So if there is an
> > english word say "Detection" indexed as such then searching for
> detection
> > or
> > detect is not giving any results, which is the expected behavior, but I
> > want
> > this kind of queries to give results.
> > I hope I made it clear. Let me know any ideas on doing the same. And one
> > more thing, I'm storing the full webpage content under a single field, I
> > hope this will not make any difference, right?
> > It seems I've to use language identifiers, but do we really need that?
> > Because we've only non-english content mixed with english[and not french
> or
> > russian etc].
> >
> > What is the best way of approaching the problem? Any thoughts!
> >
> > Thanks,
> > KK.
> >
> > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <rcmuir@gmail.com> wrote:
> >
> > > KK, is all of your latin script text actually english? Is there stuff
> > like
> > > german or french mixed in?
> > >
> > > And for your non-english content (your examples have been indian
> writing
> > > systems), is it generally true that if you had devanagari, you can
> assume
> > > its hindi? or is there stuff like marathi mixed in?
> > >
> > > Reason I say this is to invoke the right stemmers, you really need
> some
> > > language detection, but perhaps in your case you can cheat and detect
> > this
> > > based on scripts...
> > >
> > > Thanks,
> > > Robert
> > >
> > >
> > > On Wed, Jun 3, 2009 at 10:15 AM, KK <dioxide.software@gmail.com>
> wrote:
> > >
> > > > Hi All,
> > > > I'm indexing some non-english content. But the page also contains
> > english
> > > > content. As of now I'm using WhitespaceAnalyzer for all content and
> I'm
> > > > storing the full webpage content under a single filed. Now we
> require
> > to
> > > > support case folding and stemmming for the english content
> intermingled
> > > > with
> > > > non-english content. I must metion that we dont have stemming and
> case
> > > > folding for these non-english content. I'm stuck with this. Some one
> do
> > > let
> > > > me know how to proceed for fixing this issue.
> > > >
> > > > Thanks,
> > > > KK.
> > > >
> > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcmuir@gmail.com
> > >
> >
> 
> 
> 
> --
> Robert Muir
> rcmuir@gmail.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message