Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 2019 invoked from network); 20 Dec 2007 21:32:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 20 Dec 2007 21:32:54 -0000 Received: (qmail 78584 invoked by uid 500); 20 Dec 2007 21:32:36 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 78553 invoked by uid 500); 20 Dec 2007 21:32:36 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 78542 invoked by uid 99); 20 Dec 2007 21:32:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Dec 2007 13:32:36 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tareque@controldocs.com designates 65.117.150.145 as permitted sender) Received: from [65.117.150.145] (HELO webmail.controldocs.com) (65.117.150.145) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Dec 2007 21:32:14 +0000 Received: from webmail.controldocs.com (webmail.controldocs.com [127.0.0.1]) by webmail.controldocs.com (8.13.4/8.13.4/Debian-3sarge3) with ESMTP id lBKLWHEc016695 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Thu, 20 Dec 2007 15:32:17 -0600 Received: (from www-data@localhost) by webmail.controldocs.com (8.13.4/8.13.4/Submit) id lBKLWGMu016694; Thu, 20 Dec 2007 15:32:16 -0600 X-Authentication-Warning: webmail.controldocs.com: www-data set sender to tareque@controldocs.com using -f Received: from 192.168.5.115 (SquirrelMail authenticated user tareque) by webmail.controldocs.com with HTTP; Thu, 20 Dec 2007 15:32:16 -0600 (CST) Message-ID: <9639.192.168.5.115.1198186336.squirrel@webmail.controldocs.com> In-Reply-To: References: <4700F72B.1010609@propylon.com> <0C09F3B2-8C01-4C61-970E-B4673971B526@gmail.com> <4700FC1C.6010707@propylon.com> <48b038c60710010703u1370e6cu170d15fe1480f607@mail.gmail.com> <47010619.4030909@propylon.com> <19356.192.168.5.115.1198172582.squirrel@webmail.controldocs.com> <1D6E6CE2-0F5E-4E9B-A28D-BB6A8075BD63@gmail.com> <19410.192.168.5.115.1198178505.squirrel@webmail.controldocs.com> Date: Thu, 20 Dec 2007 15:32:16 -0600 (CST) Subject: Re: Changing the Punctuation definition for StandardAnalyzer From: tareque@controldocs.com To: java-user@lucene.apache.org User-Agent: SquirrelMail/1.4.4 MIME-Version: 1.0 Content-Type: text/plain;charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Priority: 3 (Normal) Importance: Normal X-Virus-Checked: Checked by ClamAV on apache.org Karl, I should have mentioned before, I have Lucene 1.9.1. In fact I had previously located the grammar in StandardTokenizer.jj (just wasn't sure if that was the one u were talking about) and had commented out EMAIL entries from all the following files: StandardTokenizer.java StandardTokenizer.jj StandardTokenizerConstants.java But evidently the tokenizer was expecting the email addresses to be one of the other TOKEN types. But since they were matching with none of them it was throwing a ParseException. Now what is puzzling to me is that though I don't see the '@' (unicode value 0040) sign to be included in "LETTER" or any other definition, why is it not splitting the words? It certainly isn't, which is why Tokenizer is expecting the email address to be defined as a TYPE. My understanding, looking at the code, is that whichever characters were not defined in the grammar, would be acting as splitter, since they are not contributing to any TOKEN definition. Please let me know what I am missing. Thanks Tareque > > 20 dec 2007 kl. 20.21 skrev tareque@controldocs.com: > >> I would rather like to modify the lexer grammar. But exactly where >> it is >> defined. After having a quick look, seems like >> StandardTokenizerTokenManager.java may be where it is being done. > > http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex > > It can be generated with the Ant build. > > -- > karl > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org