Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 82930 invoked from network); 30 May 2007 09:58:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 May 2007 09:58:27 -0000 Received: (qmail 36315 invoked by uid 500); 30 May 2007 09:58:23 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 36281 invoked by uid 500); 30 May 2007 09:58:23 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 36270 invoked by uid 99); 30 May 2007 09:58:23 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 May 2007 02:58:23 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [193.110.102.5] (HELO linuxap01e.dmc.de) (193.110.102.5) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 May 2007 02:58:16 -0700 Received: from [127.0.0.1] (helo=localhost) by linuxap01e.dmc.de with esmtp (Exim 4.63) (envelope-from ) id 1HtKwE-0000Ia-GL for java-user@lucene.apache.org; Wed, 30 May 2007 11:57:54 +0200 X-Virus-Scanned: amavisd-new at dmc.de Received: from linuxap01e.dmc.de ([127.0.0.1]) by localhost (linux-ap01e.dmc.de [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id q-4cxpNJXC8h for ; Wed, 30 May 2007 11:57:50 +0200 (CEST) Received: from [193.110.102.2] (helo=fafnir.dmc.local) by linuxap01e.dmc.de with esmtp (Exim 4.63) (envelope-from ) id 1HtKwA-0000IO-OQ for java-user@lucene.apache.org; Wed, 30 May 2007 11:57:50 +0200 Received: by fafnir.intra.dmc.de with Internet Mail Service (5.5.2653.19) id ; Wed, 30 May 2007 11:57:50 +0200 Message-ID: From: =?iso-8859-1?Q?Michael_B=F6ckling?= To: "'java-user@lucene.apache.org'" Subject: AW: Modifying StandardAnalyzer so that it also splits words after pun ctuation characters that are not followed by whitespace Date: Wed, 30 May 2007 11:57:49 +0200 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Ok, I've followed your advice and commented out some Lines in the NUM section. It now works as espected, thanks a lot, I just tried and it = does what I wanted it to do now. It looks scary, but isn't that bad.=20 Thanks! Regards, Michael > -----Urspr=FCngliche Nachricht----- > Von: Steven Rowe [mailto:sarowe@syr.edu] > Gesendet: Dienstag, 29. Mai 2007 19:54 > An: java-user@lucene.apache.org > Betreff: Re: Modifying StandardAnalyzer so that it also splits words > after pun ctuation characters that are not followed by whitespace >=20 >=20 > Hi Michael, >=20 > Michael B=F6ckling wrote: > > Hi folks! > >=20 > > The topic says it all: I want to modify the=20 > StandardAnalyzer so that it also > > splits words after punctuation characters (.,: etc.) that=20 > are NOT followed > > by a whitespace character, in addition to punctuation=20 > characters that ARE > > followed by whitespace. > >=20 > > Of course i've looked at StandardTokenizer.jj, but I don't=20 > quite get it. The > > recursive nature of the grammar bends my mind. > >=20 > > Can someone smarter than me help here? >=20 > Um, that probably disqualifies me, but anyway... >=20 > There are several regexes in StandardTokenizer.jj that generate = tokens > containing punctuation. You should be able to selectively=20 > comment them > out to achieve what you want: >=20 > 1. Acronyms: >=20 > | "." ( ".")+ > >=20 > 2. Company names: >=20 > | ("&"|"@") > >=20 > 3. Email addresses: >=20 > | (("."|"-"|"_") )* "@" > (("."|"-") )+ > >=20 > 4. Hostnames: >=20 > | ("." )+ > >=20 > 5. The ,

and regexes, for IP addresses, etc.: >=20 > |

> |

> | (

)+ > | (

)+ > |

(

=20 > )+ > |

(

=20 > )+ > ) > > > | <#P: ("_"|"-"|"/"|"."|",") > > | <#HAS_DIGIT: // at least one digit > (|)* > > (|)* > > >=20 >=20 > Steve >=20 > --=20 > Steve Rowe > Center for Natural Language Processing > http://www.cnlp.org/tech/lucene.asp >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org >=20 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org