lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: How to support stemming and case folding for english content mixed with non-english content?
Date Thu, 04 Jun 2009 11:18:20 GMT
KK, ok, so you only really want to stem the english. This is good.

Is it possible for you to consider using solr? solr's default analyzer for
type 'text' will be good for your case. it will do the following
1. tokenize on whitespace
2. handle both indian language and english punctuation
3. lowercase the english.
4. stem the english.

try a nightly build, http://people.apache.org/builds/lucene/solr/nightly/

On Thu, Jun 4, 2009 at 1:12 AM, KK <dioxide.software@gmail.com> wrote:

> Muir, thanks for your response.
> I'm indexing indian language web pages which has got descent amount of
> english content mixed with therein. For the time being I'm not going to use
> any stemmers as we don't have standard stemmers for indian languages . So
> what I want to do is like this,
> Say I've a web page having hindi content with 5% english content. Then for
> hindi I want to use the basic white space analyzer as we dont have stemmers
> for this as I mentioned earlier and whereever english appears I want them
> to
> be stemmed tokenized etc[the standard process used for english content]. As
> of now I'm using whitespace analyzer for the full content which doesnot
> support case folding, stemming etc for teh content. So if there is an
> english word say "Detection" indexed as such then searching for detection
> or
> detect is not giving any results, which is the expected behavior, but I
> want
> this kind of queries to give results.
> I hope I made it clear. Let me know any ideas on doing the same. And one
> more thing, I'm storing the full webpage content under a single field, I
> hope this will not make any difference, right?
> It seems I've to use language identifiers, but do we really need that?
> Because we've only non-english content mixed with english[and not french or
> russian etc].
>
> What is the best way of approaching the problem? Any thoughts!
>
> Thanks,
> KK.
>
> On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <rcmuir@gmail.com> wrote:
>
> > KK, is all of your latin script text actually english? Is there stuff
> like
> > german or french mixed in?
> >
> > And for your non-english content (your examples have been indian writing
> > systems), is it generally true that if you had devanagari, you can assume
> > its hindi? or is there stuff like marathi mixed in?
> >
> > Reason I say this is to invoke the right stemmers, you really need some
> > language detection, but perhaps in your case you can cheat and detect
> this
> > based on scripts...
> >
> > Thanks,
> > Robert
> >
> >
> > On Wed, Jun 3, 2009 at 10:15 AM, KK <dioxide.software@gmail.com> wrote:
> >
> > > Hi All,
> > > I'm indexing some non-english content. But the page also contains
> english
> > > content. As of now I'm using WhitespaceAnalyzer for all content and I'm
> > > storing the full webpage content under a single filed. Now we require
> to
> > > support case folding and stemmming for the english content intermingled
> > > with
> > > non-english content. I must metion that we dont have stemming and case
> > > folding for these non-english content. I'm stuck with this. Some one do
> > let
> > > me know how to proceed for fixing this issue.
> > >
> > > Thanks,
> > > KK.
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>



-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message