Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of rcmuir@gmail.com designates
 209.85.132.248 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=GJdlp1Lh0p9vM4HA9m6xN6TcBbN2V8jOihvVaI4P9FEfHcCGHFNjTRizEabc/z6daP
         ipPOmPnc3FNUrOpffl43aSNrgztyf/FurTtYlN0gb+88+wqu3NPIW97zZEjlY95Sgnx6
         hMvsWpjsjwAzSb/BQyc259LYFeods6VWhpV/U=
MIME-Version: 1.0
In-Reply-To: <ED1D56A848EF4BE6A4C5398C2782F470@VEGA>
References: <8db6d74a0906030715s7d34d5c5j76c1815631f1418a@mail.gmail.com>
	 <8f0ad1f30906030912y6b7abe07j48c6d08ffba1b30c@mail.gmail.com>
	 <8db6d74a0906032212i62760b58r6799f3f441ee1ece@mail.gmail.com>
	 <8f0ad1f30906040418i1577b66em72563e4817549b6@mail.gmail.com>
	 <ED1D56A848EF4BE6A4C5398C2782F470@VEGA>
Date: Thu, 4 Jun 2009 08:00:21 -0400
Message-ID: <8f0ad1f30906040500q38d4b92bsf5dc03bf70ff98d3@mail.gmail.com>
Subject: Re: How to support stemming and case folding for english content
	mixed with non-english content?
From: Robert Muir <rcmuir@gmail.com>
To: java-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=0016368e1c2b3b65ac046b848544

--0016368e1c2b3b65ac046b848544
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

yes this is true. for starters KK, might be good to startup solr and look at
http://localhost:8983/solr/admin/analysis.jsp?highlight=on

if you want to stick with lucene, the WordDelimiterFilter is the piece you
will want for your text, mainly for punctuation but also for format
characters such as ZWJ/ZWNJ.

On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <uwe@thetaphi.de> wrote:

> You can also re-use the solr analyzers, as far as I found out. There is an
> issue in jIRA/discussion on java-dev to merge them.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> > -----Original Message-----
> > From: Robert Muir [mailto:rcmuir@gmail.com]
> > Sent: Thursday, June 04, 2009 1:18 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: How to support stemming and case folding for english content
> > mixed with non-english content?
> >
> > KK, ok, so you only really want to stem the english. This is good.
> >
> > Is it possible for you to consider using solr? solr's default analyzer
> for
> > type 'text' will be good for your case. it will do the following
> > 1. tokenize on whitespace
> > 2. handle both indian language and english punctuation
> > 3. lowercase the english.
> > 4. stem the english.
> >
> > try a nightly build,
> http://people.apache.org/builds/lucene/solr/nightly/
> >
> > On Thu, Jun 4, 2009 at 1:12 AM, KK <dioxide.software@gmail.com> wrote:
> >
> > > Muir, thanks for your response.
> > > I'm indexing indian language web pages which has got descent amount of
> > > english content mixed with therein. For the time being I'm not going to
> > use
> > > any stemmers as we don't have standard stemmers for indian languages .
> > So
> > > what I want to do is like this,
> > > Say I've a web page having hindi content with 5% english content. Then
> > for
> > > hindi I want to use the basic white space analyzer as we dont have
> > stemmers
> > > for this as I mentioned earlier and whereever english appears I want
> > them
> > > to
> > > be stemmed tokenized etc[the standard process used for english
> content].
> > As
> > > of now I'm using whitespace analyzer for the full content which doesnot
> > > support case folding, stemming etc for teh content. So if there is an
> > > english word say "Detection" indexed as such then searching for
> > detection
> > > or
> > > detect is not giving any results, which is the expected behavior, but I
> > > want
> > > this kind of queries to give results.
> > > I hope I made it clear. Let me know any ideas on doing the same. And
> one
> > > more thing, I'm storing the full webpage content under a single field,
> I
> > > hope this will not make any difference, right?
> > > It seems I've to use language identifiers, but do we really need that?
> > > Because we've only non-english content mixed with english[and not
> french
> > or
> > > russian etc].
> > >
> > > What is the best way of approaching the problem? Any thoughts!
> > >
> > > Thanks,
> > > KK.
> > >
> > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <rcmuir@gmail.com> wrote:
> > >
> > > > KK, is all of your latin script text actually english? Is there stuff
> > > like
> > > > german or french mixed in?
> > > >
> > > > And for your non-english content (your examples have been indian
> > writing
> > > > systems), is it generally true that if you had devanagari, you can
> > assume
> > > > its hindi? or is there stuff like marathi mixed in?
> > > >
> > > > Reason I say this is to invoke the right stemmers, you really need
> > some
> > > > language detection, but perhaps in your case you can cheat and detect
> > > this
> > > > based on scripts...
> > > >
> > > > Thanks,
> > > > Robert
> > > >
> > > >
> > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <dioxide.software@gmail.com>
> > wrote:
> > > >
> > > > > Hi All,
> > > > > I'm indexing some non-english content. But the page also contains
> > > english
> > > > > content. As of now I'm using WhitespaceAnalyzer for all content and
> > I'm
> > > > > storing the full webpage content under a single filed. Now we
> > require
> > > to
> > > > > support case folding and stemmming for the english content
> > intermingled
> > > > > with
> > > > > non-english content. I must metion that we dont have stemming and
> > case
> > > > > folding for these non-english content. I'm stuck with this. Some
> one
> > do
> > > > let
> > > > > me know how to proceed for fixing this issue.
> > > > >
> > > > > Thanks,
> > > > > KK.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Robert Muir
> > > > rcmuir@gmail.com
> > > >
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

--0016368e1c2b3b65ac046b848544--