Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 96847 invoked from network); 4 Jun 2009 12:00:45 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 4 Jun 2009 12:00:45 -0000 Received: (qmail 58551 invoked by uid 500); 4 Jun 2009 12:00:55 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 58482 invoked by uid 500); 4 Jun 2009 12:00:55 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 58472 invoked by uid 99); 4 Jun 2009 12:00:54 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Jun 2009 12:00:54 +0000 X-ASF-Spam-Status: No, hits=3.7 required=10.0 tests=HTML_MESSAGE,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of rcmuir@gmail.com designates 209.85.132.248 as permitted sender) Received: from [209.85.132.248] (HELO an-out-0708.google.com) (209.85.132.248) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Jun 2009 12:00:43 +0000 Received: by an-out-0708.google.com with SMTP id b6so345733ana.5 for ; Thu, 04 Jun 2009 05:00:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=B02B300TEUovv0WsobIV8HEmGPVUadtQAjXTNqjW2vY=; b=YOMHRTH4IDFq/Rb1DeLm/pxK+H//PV16R9y1PLLX1S0gGTrhIwUpHpPl8kFK9Qa/Si P0YZmcS3to1sYlYF7lvNrgwD7habySy8Fb0j9cycOa0SqPSvMTBrDYuOsGvzrJqpDBzs 6VXo9/pa/x5lTR+9p2Mogbw7WfHuobi1je3pw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=GJdlp1Lh0p9vM4HA9m6xN6TcBbN2V8jOihvVaI4P9FEfHcCGHFNjTRizEabc/z6daP ipPOmPnc3FNUrOpffl43aSNrgztyf/FurTtYlN0gb+88+wqu3NPIW97zZEjlY95Sgnx6 hMvsWpjsjwAzSb/BQyc259LYFeods6VWhpV/U= MIME-Version: 1.0 Received: by 10.100.249.5 with SMTP id w5mr2459705anh.28.1244116821959; Thu, 04 Jun 2009 05:00:21 -0700 (PDT) In-Reply-To: References: <8db6d74a0906030715s7d34d5c5j76c1815631f1418a@mail.gmail.com> <8f0ad1f30906030912y6b7abe07j48c6d08ffba1b30c@mail.gmail.com> <8db6d74a0906032212i62760b58r6799f3f441ee1ece@mail.gmail.com> <8f0ad1f30906040418i1577b66em72563e4817549b6@mail.gmail.com> Date: Thu, 4 Jun 2009 08:00:21 -0400 Message-ID: <8f0ad1f30906040500q38d4b92bsf5dc03bf70ff98d3@mail.gmail.com> Subject: Re: How to support stemming and case folding for english content mixed with non-english content? From: Robert Muir To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0016368e1c2b3b65ac046b848544 X-Virus-Checked: Checked by ClamAV on apache.org --0016368e1c2b3b65ac046b848544 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit yes this is true. for starters KK, might be good to startup solr and look at http://localhost:8983/solr/admin/analysis.jsp?highlight=on if you want to stick with lucene, the WordDelimiterFilter is the piece you will want for your text, mainly for punctuation but also for format characters such as ZWJ/ZWNJ. On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler wrote: > You can also re-use the solr analyzers, as far as I found out. There is an > issue in jIRA/discussion on java-dev to merge them. > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: uwe@thetaphi.de > > > > -----Original Message----- > > From: Robert Muir [mailto:rcmuir@gmail.com] > > Sent: Thursday, June 04, 2009 1:18 PM > > To: java-user@lucene.apache.org > > Subject: Re: How to support stemming and case folding for english content > > mixed with non-english content? > > > > KK, ok, so you only really want to stem the english. This is good. > > > > Is it possible for you to consider using solr? solr's default analyzer > for > > type 'text' will be good for your case. it will do the following > > 1. tokenize on whitespace > > 2. handle both indian language and english punctuation > > 3. lowercase the english. > > 4. stem the english. > > > > try a nightly build, > http://people.apache.org/builds/lucene/solr/nightly/ > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK wrote: > > > > > Muir, thanks for your response. > > > I'm indexing indian language web pages which has got descent amount of > > > english content mixed with therein. For the time being I'm not going to > > use > > > any stemmers as we don't have standard stemmers for indian languages . > > So > > > what I want to do is like this, > > > Say I've a web page having hindi content with 5% english content. Then > > for > > > hindi I want to use the basic white space analyzer as we dont have > > stemmers > > > for this as I mentioned earlier and whereever english appears I want > > them > > > to > > > be stemmed tokenized etc[the standard process used for english > content]. > > As > > > of now I'm using whitespace analyzer for the full content which doesnot > > > support case folding, stemming etc for teh content. So if there is an > > > english word say "Detection" indexed as such then searching for > > detection > > > or > > > detect is not giving any results, which is the expected behavior, but I > > > want > > > this kind of queries to give results. > > > I hope I made it clear. Let me know any ideas on doing the same. And > one > > > more thing, I'm storing the full webpage content under a single field, > I > > > hope this will not make any difference, right? > > > It seems I've to use language identifiers, but do we really need that? > > > Because we've only non-english content mixed with english[and not > french > > or > > > russian etc]. > > > > > > What is the best way of approaching the problem? Any thoughts! > > > > > > Thanks, > > > KK. > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir wrote: > > > > > > > KK, is all of your latin script text actually english? Is there stuff > > > like > > > > german or french mixed in? > > > > > > > > And for your non-english content (your examples have been indian > > writing > > > > systems), is it generally true that if you had devanagari, you can > > assume > > > > its hindi? or is there stuff like marathi mixed in? > > > > > > > > Reason I say this is to invoke the right stemmers, you really need > > some > > > > language detection, but perhaps in your case you can cheat and detect > > > this > > > > based on scripts... > > > > > > > > Thanks, > > > > Robert > > > > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK > > wrote: > > > > > > > > > Hi All, > > > > > I'm indexing some non-english content. But the page also contains > > > english > > > > > content. As of now I'm using WhitespaceAnalyzer for all content and > > I'm > > > > > storing the full webpage content under a single filed. Now we > > require > > > to > > > > > support case folding and stemmming for the english content > > intermingled > > > > > with > > > > > non-english content. I must metion that we dont have stemming and > > case > > > > > folding for these non-english content. I'm stuck with this. Some > one > > do > > > > let > > > > > me know how to proceed for fixing this issue. > > > > > > > > > > Thanks, > > > > > KK. > > > > > > > > > > > > > > > > > > > > > -- > > > > Robert Muir > > > > rcmuir@gmail.com > > > > > > > > > > > > > > > -- > > Robert Muir > > rcmuir@gmail.com > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > -- Robert Muir rcmuir@gmail.com --0016368e1c2b3b65ac046b848544--