Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 18255 invoked from network); 4 Jun 2009 12:49:48 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 4 Jun 2009 12:49:48 -0000 Received: (qmail 39136 invoked by uid 500); 4 Jun 2009 12:49:57 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 39071 invoked by uid 500); 4 Jun 2009 12:49:57 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 39061 invoked by uid 99); 4 Jun 2009 12:49:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Jun 2009 12:49:57 +0000 X-ASF-Spam-Status: No, hits=3.7 required=10.0 tests=HTML_MESSAGE,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of rcmuir@gmail.com designates 209.85.132.243 as permitted sender) Received: from [209.85.132.243] (HELO an-out-0708.google.com) (209.85.132.243) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Jun 2009 12:49:45 +0000 Received: by an-out-0708.google.com with SMTP id b6so359916ana.5 for ; Thu, 04 Jun 2009 05:49:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=rhPcnP7fLEhZLpOQu5+JmgN0/PAB5/D4Pqf6uHPF37Y=; b=Pm/qq1YRCTQV+JZH/DYKX8m78dP4GBMH0MWM33nJRPfKPLWyTKWHyOsRJdYG+3lVsw XqUIHK9Iy59+xn9x6AoGO2JDw8NzxylgFMjAhjJcBBH0vAyDXRiYXEgZFyzk1HO3Hyud T4jorgzSxQDfEBqE1LyBXC/9jIyz7bM4IDQdI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=fYgDqdEq9C5eTqnlM/Fs0jaOt0Fqjt+VT33tTGQxjoK/atq0ilbrMSW+qRk52IPku+ cyTFecbAI4ADRUFu8IpCdCaKteX2H+XIHs5cJABYGzX6maNYn9FHHLQnu6n7xUu+EtbG zKSPfWdHiU/3w3N4J9G0u4TpwDk0P1M5KGUDw= MIME-Version: 1.0 Received: by 10.100.205.15 with SMTP id c15mr2574696ang.5.1244119763860; Thu, 04 Jun 2009 05:49:23 -0700 (PDT) In-Reply-To: <8db6d74a0906040528p557c58b6i29153f460b4923e6@mail.gmail.com> References: <8db6d74a0906030715s7d34d5c5j76c1815631f1418a@mail.gmail.com> <8f0ad1f30906030912y6b7abe07j48c6d08ffba1b30c@mail.gmail.com> <8db6d74a0906032212i62760b58r6799f3f441ee1ece@mail.gmail.com> <8f0ad1f30906040418i1577b66em72563e4817549b6@mail.gmail.com> <8f0ad1f30906040500q38d4b92bsf5dc03bf70ff98d3@mail.gmail.com> <8db6d74a0906040528p557c58b6i29153f460b4923e6@mail.gmail.com> Date: Thu, 4 Jun 2009 08:49:23 -0400 Message-ID: <8f0ad1f30906040549t75d68edbse13cbb541e14f816@mail.gmail.com> Subject: Re: How to support stemming and case folding for english content mixed with non-english content? From: Robert Muir To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0016369202cb9547f0046b853441 X-Virus-Checked: Checked by ClamAV on apache.org --0016369202cb9547f0046b853441 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit KK, for your case, you don't really need to go to the effort of detecting whether fragments are english or not. Because the English stemmers in lucene will not modify your Indic text, and neither will the LowerCaseFilter. what you want to do is create a custom analyzer that works like this -WhitespaceTokenizer with WordDelimiterFilter [from Solr nightly jar], LowerCaseFilter, StopFilter, and PorterStemFilter- Thanks, Robert On Thu, Jun 4, 2009 at 8:28 AM, KK wrote: > Thank you all. > To be frank I was using Solr in the begining half a month ago. The > problem[rather bug] with solr was creation of new index on the fly. Though > they have a restful method for teh same, but it was not working. If I > remember properly one of Solr commiter "Noble Paul"[I dont know his real > name] was trying to help me. I tried many nightly builds and spending a > couple of days stuck at that made me think of lucene and I switched to it. > Now after working with lucene which gives you full control of everything I > don't want to switch to Solr.[LOL, to me Solr:Lucene is similar to > Window$:Linux, its my view only, though]. Coming back to the point as Uwe > mentioned that we can do the same thing in lucene as well, what is > available > in Solr, Solr is based on Lucene only, right? > I request Uwe to give me some more ideas on using the analyzers from solr > that will do the job for me, handling a mix of both english and non-english > content. > Muir, can you give me a bit detail description of how to use the > WordDelimiteFilter to do my job. > On a side note, I was thingking of writing a simple analyzer that will do > the following, > #. If the webpage fragment is non-english[for me its some indian language] > then index them as such, no stemming/ stop word removal to begin with. As I > know its in UCN unicode something like \u0021\u0012\u34ae\u0031[just a > sample] > # If the fragment is english then apply standard anlyzing process for > english content. I've not thought of quering in the same way as of now i.e > mix of non-english and engish words. > Now to get all this, > #1. I need some sort of way which will let me know if the content is > english or not. If not english just add the tokens to the document. Do we > really need language identifiers, as i dont have any other content that > uses > the same script as english other than those \u1234 things for my indian > language content. Any smart hack/trick for the same? > #2. If the its english apply all normal process and add the stemmed token > to document. > For all this I was thinking of iterating earch word of the web page and > apply the above procedure. And finallyadd the newly created document to > the > index. > > I would like some one to guide me in this direction. I'm pretty people must > have done similar/same thing earlier, I request them to guide me/ point me > to some tutorials for the same. > Else help me out writing a custom analyzer only if thats not going to be > too > complex. LOL, I'm a new user to lucene and know basics of Java coding. > Thank you very much. > > --KK. > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir wrote: > > > yes this is true. for starters KK, might be good to startup solr and look > > at > > http://localhost:8983/solr/admin/analysis.jsp?highlight=on > > > > if you want to stick with lucene, the WordDelimiterFilter is the piece > you > > will want for your text, mainly for punctuation but also for format > > characters such as ZWJ/ZWNJ. > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler wrote: > > > > > You can also re-use the solr analyzers, as far as I found out. There is > > an > > > issue in jIRA/discussion on java-dev to merge them. > > > > > > ----- > > > Uwe Schindler > > > H.-H.-Meier-Allee 63, D-28213 Bremen > > > http://www.thetaphi.de > > > eMail: uwe@thetaphi.de > > > > > > > > > > -----Original Message----- > > > > From: Robert Muir [mailto:rcmuir@gmail.com] > > > > Sent: Thursday, June 04, 2009 1:18 PM > > > > To: java-user@lucene.apache.org > > > > Subject: Re: How to support stemming and case folding for english > > content > > > > mixed with non-english content? > > > > > > > > KK, ok, so you only really want to stem the english. This is good. > > > > > > > > Is it possible for you to consider using solr? solr's default > analyzer > > > for > > > > type 'text' will be good for your case. it will do the following > > > > 1. tokenize on whitespace > > > > 2. handle both indian language and english punctuation > > > > 3. lowercase the english. > > > > 4. stem the english. > > > > > > > > try a nightly build, > > > http://people.apache.org/builds/lucene/solr/nightly/ > > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK > wrote: > > > > > > > > > Muir, thanks for your response. > > > > > I'm indexing indian language web pages which has got descent amount > > of > > > > > english content mixed with therein. For the time being I'm not > going > > to > > > > use > > > > > any stemmers as we don't have standard stemmers for indian > languages > > . > > > > So > > > > > what I want to do is like this, > > > > > Say I've a web page having hindi content with 5% english content. > > Then > > > > for > > > > > hindi I want to use the basic white space analyzer as we dont have > > > > stemmers > > > > > for this as I mentioned earlier and whereever english appears I > want > > > > them > > > > > to > > > > > be stemmed tokenized etc[the standard process used for english > > > content]. > > > > As > > > > > of now I'm using whitespace analyzer for the full content which > > doesnot > > > > > support case folding, stemming etc for teh content. So if there is > an > > > > > english word say "Detection" indexed as such then searching for > > > > detection > > > > > or > > > > > detect is not giving any results, which is the expected behavior, > but > > I > > > > > want > > > > > this kind of queries to give results. > > > > > I hope I made it clear. Let me know any ideas on doing the same. > And > > > one > > > > > more thing, I'm storing the full webpage content under a single > > field, > > > I > > > > > hope this will not make any difference, right? > > > > > It seems I've to use language identifiers, but do we really need > > that? > > > > > Because we've only non-english content mixed with english[and not > > > french > > > > or > > > > > russian etc]. > > > > > > > > > > What is the best way of approaching the problem? Any thoughts! > > > > > > > > > > Thanks, > > > > > KK. > > > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir > > wrote: > > > > > > > > > > > KK, is all of your latin script text actually english? Is there > > stuff > > > > > like > > > > > > german or french mixed in? > > > > > > > > > > > > And for your non-english content (your examples have been indian > > > > writing > > > > > > systems), is it generally true that if you had devanagari, you > can > > > > assume > > > > > > its hindi? or is there stuff like marathi mixed in? > > > > > > > > > > > > Reason I say this is to invoke the right stemmers, you really > need > > > > some > > > > > > language detection, but perhaps in your case you can cheat and > > detect > > > > > this > > > > > > based on scripts... > > > > > > > > > > > > Thanks, > > > > > > Robert > > > > > > > > > > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK > > > > wrote: > > > > > > > > > > > > > Hi All, > > > > > > > I'm indexing some non-english content. But the page also > contains > > > > > english > > > > > > > content. As of now I'm using WhitespaceAnalyzer for all content > > and > > > > I'm > > > > > > > storing the full webpage content under a single filed. Now we > > > > require > > > > > to > > > > > > > support case folding and stemmming for the english content > > > > intermingled > > > > > > > with > > > > > > > non-english content. I must metion that we dont have stemming > and > > > > case > > > > > > > folding for these non-english content. I'm stuck with this. > Some > > > one > > > > do > > > > > > let > > > > > > > me know how to proceed for fixing this issue. > > > > > > > > > > > > > > Thanks, > > > > > > > KK. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Robert Muir > > > > > > rcmuir@gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Robert Muir > > > > rcmuir@gmail.com > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > > > > > > > -- > > Robert Muir > > rcmuir@gmail.com > > > -- Robert Muir rcmuir@gmail.com --0016369202cb9547f0046b853441--