Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 92372 invoked from network); 30 Aug 2005 10:52:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 30 Aug 2005 10:52:01 -0000 Received: (qmail 34365 invoked by uid 500); 30 Aug 2005 10:51:57 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 34260 invoked by uid 500); 30 Aug 2005 10:51:56 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 34246 invoked by uid 99); 30 Aug 2005 10:51:56 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Aug 2005 03:51:56 -0700 X-ASF-Spam-Status: No, hits=0.1 required=10.0 tests=FORGED_RCVD_HELO X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [217.115.67.77] (HELO trixi.wincor-nixdorf.com) (217.115.67.77) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Aug 2005 03:52:11 -0700 Received: from zipp.wincor-nixdorf.com (this.is.a.RFC1918.address [172.24.192.70] (may be forged)) by trixi.wincor-nixdorf.com (8.12.11/8.12.11) with ESMTP id j7UAxMlT005768 for ; Tue, 30 Aug 2005 12:59:22 +0200 Received: from deexcs02.wincor-nixdorf.com (deexcs02.wincor-nixdorf.com [172.18.160.72]) by zipp.wincor-nixdorf.com (8.12.8/8.12.8) with ESMTP id j7UApqWB031588 for ; Tue, 30 Aug 2005 12:51:52 +0200 Received: from DEEXVS01.wincor-nixdorf.com ([172.18.160.81]) by deexcs02.wincor-nixdorf.com with Microsoft SMTPSVC(6.0.3790.0); Tue, 30 Aug 2005 12:51:52 +0200 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: quoted-printable X-MimeOLE: Produced By Microsoft Exchange V6.5.7226.0 Subject: RE: Inconsistent tokenizing of words containing underscores. Date: Tue, 30 Aug 2005 12:52:29 +0200 Message-ID: <9391EFF529350A4A9AAC16AAAAD9B5C60A3A07@DEEXVS01.wincor-nixdorf.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Inconsistent tokenizing of words containing underscores. Thread-Index: AcWszezTBwZ3sFrBRx+vMSlhgEKP5AAGG69AABllElA= From: "Is, Studcio" To: X-OriginalArrivalTime: 30 Aug 2005 10:51:52.0125 (UTC) FILETIME=[D454E6D0:01C5AD50] X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Hello, first of all thanks to everyone for replies and suggestions. I solved my problem by adapting the StandardTokenizer.jj and compiling it using javacc. I replaced line 90: |)+ > with ||"_")+ > so that underscore is treated like alphanumeric characters. In my first tests, it seems to work perfectly. Anyhow, the problem remains that I can't understand how the described bevaviour might be the expected behaviour. I couldn't find the appropriate documentation in the javacc source of the tokenizer either. I suppose the source of the problem with underscores lies in the definition of NUM (floating point, serial, model numbers, ip addresses, etc.). No matter what, I guess my problem is solved. Thanks again and regards Sebastian =20 -----Original Message----- From: Aigner, Thomas [mailto:TAigner@WescoDist.com]=20 Sent: Tuesday, August 30, 2005 12:12 AM To: java-user@lucene.apache.org Subject: RE: Inconsistent tokenizing of words containing underscores. What seems to be working for me is a punctuation filter that removes / - _ etc and makes the token without them. Then "most" of the time the word XYZZZY_DE_SA0001 will be tokenized as XYZZZYDESA0001. For this to work, you will have to use the same punctuation filter on the strings before you search for them. =20 Tom -----Original Message----- From: Daniel Naber [mailto:lucenelist@danielnaber.de]=20 Sent: Monday, August 29, 2005 3:15 PM To: java-user@lucene.apache.org Subject: Re: Inconsistent tokenizing of words containing underscores. On Monday 29 August 2005 19:21, Jeremy Meyer wrote: > The expected behavior is to sometimes treat a character as indicating a > new token and other times to ignore the same character? It depends on whether there are digits in the token. It's documented in the javacc source for the tokenizer(?). Regards Daniel --=20 http://www.danielnaber.de --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org