Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 88967 invoked from network); 23 May 2007 17:07:46 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 23 May 2007 17:07:46 -0000 Received: (qmail 89203 invoked by uid 500); 23 May 2007 17:07:44 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 88979 invoked by uid 500); 23 May 2007 17:07:44 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 88949 invoked by uid 99); 23 May 2007 17:07:44 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 May 2007 10:07:43 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [128.230.18.29] (HELO mailer.syr.edu) (128.230.18.29) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 May 2007 10:07:36 -0700 Received: from [128.230.84.100] (ist-h335-d03.syr.edu) by mailer.syr.edu (LSMTP for Windows NT v1.1b) with SMTP id <0.163BEFC1@mailer.syr.edu>; Wed, 23 May 2007 13:07:15 -0400 Message-ID: <4654747C.80005@syr.edu> Date: Wed, 23 May 2007 13:06:04 -0400 From: Steven Rowe User-Agent: Mail/News 1.5.0.4-GroupWise-IMAP-fix (Windows/20060619) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: WhitespaceAnalyzer [was: Re: regaridng Reader.terms()] References: <34b8543c0705220229v362aa368je4a1ce8633a60ff5@mail.gmail.com> <4652C220.7090505@ecomware.it> <34b8543c0705220318i2acd86f3y210dede1960ea5a5@mail.gmail.com> <465303E2.6020802@syr.edu> <34b8543c0705222238x3536464enec89a65969fdc7a8@mail.gmail.com> In-Reply-To: <34b8543c0705222238x3536464enec89a65969fdc7a8@mail.gmail.com> X-Enigmail-Version: 0.94.3.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hi Mohammad, WhitespaceAnalyzer uses Java's Character.isWhitespace(char) method to determine whether or not a character should be part of a token. As far as I know, this method is problematic only for characters outside of the Basic Multilingual Plane (BMP). I think Lucene should switch to using the int-based character methods, to support characters outside of the BMP. Arabic characters are all in the BMP[1][2][3][4], so this method should work properly. The only Arabic or Persian characters I can find outside of the BMP are in the Old Persian block[5], [ U+103A0 - U+103DF ], but this is ancient cuneiform - are you really dealing with digital cuneiform documents? I suspect that you are using WhitespaceAnalyzer as the basis for a more sophisticated tokenizer - if this is true, you may want to check out the tokenizer in the AraMorph project[6][7]. (The AraMorph stemmer probably will not serve your needs, though, since Persian and Arabic have different lexicons and grammars.) Hope it helps, Steve [1] Arabic: http://www.unicode.org/charts/PDF/U0600.pdf [2] Arabic Supplement: http://www.unicode.org/charts/PDF/U0750.pdf [3] Arabic Presentation Forms A: http://www.unicode.org/charts/PDF/UFB50.pdf [4] Arabic Presentation Forms B: http://www.unicode.org/charts/PDF/UFE70.pdf [5] Old Persian: http://www.unicode.org/charts/PDF/U103A0.pdf [6] AraMorph: http://www.nongnu.org/aramorph/ [7] ArabicTokenizer for Lucene: http://www.nongnu.org/aramorph/javadoc/gpl/pierrick/brihaye/aramorph/lucene/ArabicTokenizer.html Mohammad Norouzi wrote: > Hi Steve, > No I didn't make any change on WhiteSpaceAnalyzer I just extends my classes > from the original classes and then override my new changes. so I dont think > I should to contribute my classes. > > and my language is Persian, and only change I've made is not to ignoring > unicode characters in Persian and arabic language, because with original > WhitespaceAnalyzer it didnt work fine whether it ignore or something > else, I > dont know but I extends my classes and now I am using my analyzer to index. > > On 5/22/07, Steven Rowe wrote: >> >> Hi Mohammad, >> >> May I ask what your language is? And what kind of changes to >> WhitespaceAnalyzer were required to make it work with your language? >> >> If you have made modifications to WhitespaceAnalyzer that are generally >> useful, please consider contributing your changes back to the Lucene >> project. There is some info here on how to get started: >> >> http://wiki.apache.org/jakarta-lucene/HowToContribute >> >> Thanks, >> Steve >> >> Mohammad Norouzi wrote: >> > Walter, >> > Yes I am using a customized WhiteSpaceAnalyzer while indexing. >> > I said customized because I realized that standard WhiteSpaceAnalyzer >> dont >> > accept unicode terms in my language so I make some change to support >> that. >> > >> > but for reading no Analyzer is used >> > >> > if I want to get that result, which analyzer should I use? >> > >> > in my case, I dont need any boost factor or any other feature of >> lucene, >> I >> > need just searching through the index. >> > >> > >> > On 5/22/07, Walter Ferrara wrote: >> >> >> >> If Reader.terms() gives you: >> >> text3 >> >> text4 >> >> while you expect >> >> text3 text4 >> >> >> >> you should change, I presume, the Analyzer, maybe writing your own >> one. >> >> >> >> Mohammad Norouzi wrote: >> >> > Hi all >> >> > >> >> > consider following index >> >> > >> >> > field1 field2 field3 >> >> > text1 text1 text2 text3 text4 >> >> > text4 text2 text2 text3 text5 >> >> > >> >> > I want to get all terms in filed3 >> >> > if I use Reader.terms() it will returns: (however i have to put >> an if >> >> > statement to filter result of the field3 only) >> >> > text3 >> >> > text4 >> >> > text2 >> >> > text5 >> >> > >> >> > but I need following result: >> >> > "text3 text4" >> >> > "text2 text3 text5" >> >> > >> >> > >> >> > is this possible? if yes, how? and if no, is there any tricky way to >> >> get >> >> > this result? >> >> > >> >> > thank you so much. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org