Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0359172B5 for ; Tue, 18 Oct 2011 09:59:47 +0000 (UTC) Received: (qmail 50812 invoked by uid 500); 18 Oct 2011 08:57:46 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 50765 invoked by uid 500); 18 Oct 2011 08:57:45 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 50616 invoked by uid 99); 18 Oct 2011 08:57:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Oct 2011 08:57:44 +0000 X-ASF-Spam-Status: No, hits=1.6 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,FREEMAIL_REPLYTO_END_DIGIT,MISSING_HEADERS,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of paul_t100@fastmail.fm designates 66.111.4.29 as permitted sender) Received: from [66.111.4.29] (HELO out5.smtp.messagingengine.com) (66.111.4.29) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Oct 2011 08:57:37 +0000 Received: from compute1.internal (compute1.nyi.mail.srv.osa [10.202.2.41]) by gateway1.nyi.mail.srv.osa (Postfix) with ESMTP id 1E0C620DCB for ; Tue, 18 Oct 2011 04:57:17 -0400 (EDT) Received: from frontend1.nyi.mail.srv.osa ([10.202.2.160]) by compute1.internal (MEProxy); Tue, 18 Oct 2011 04:57:17 -0400 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=fastmail.fm; h= message-id:date:from:reply-to:mime-version:cc:subject:references :in-reply-to:content-type:content-transfer-encoding; s=mesmtp; bh=CUesau+yj1YjFKfVRALYTW7ReqY=; b=g+LNx6wLOhfHflPTDD78NPMYA82J cyXBlm5gGe5a2Dbm+GrvWpVNk1fY+w1VJU82eBUhl0N4sP7s6Fz8U5IBafNFLN9M Bf37gy/UMMptscf/YKkgJka2rI888uzAdT4SGohTvw4wLM1EgzRia+gCrfSY+zZL 4n5MXg6SG5Yxwvg= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=message-id:date:from:reply-to :mime-version:cc:subject:references:in-reply-to:content-type :content-transfer-encoding; s=smtpout; bh=CUesau+yj1YjFKfVRALYTW 7ReqY=; b=iD9ruKoUfX7DbReLj6wsnC0xJ7yiEihaepY9BEm2ff5E56X8bmBOCH JCgCMA9AWuZ9KI+a7fiJ/tcsvNYL9MjLYkT6jTceffFvtlFsnGpiwD08C9fbFXDf FkmUprFmjdAnvpTf4qtEaNGWsQJvMaf01QzhMv7aflkZ6EHk3TxeQ= X-Sasl-enc: F+YzhH42TjBKRkJA08T8PUXCd7YhjpMFb8bY2LEQn1un 1318928236 Received: from macbook.local (dhcp-095-096-056-062.chello.nl [95.96.56.62]) by mail.messagingengine.com (Postfix) with ESMTPSA id BE7A7407357 for ; Tue, 18 Oct 2011 04:57:16 -0400 (EDT) Message-ID: <4E9D3F6B.6080009@fastmail.fm> Date: Tue, 18 Oct 2011 10:57:15 +0200 From: Paul Taylor Reply-To: paul_t100@fastmail.fm User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1 MIME-Version: 1.0 CC: "java-user@lucene.apache.org" Subject: Re: How do you see if a tokenstream has tokens without consuming the tokens ? References: <4E9C1BCB.7080900@fastmail.fm> <6C78E97C707B5B4C8CC61D44F8754586032615@SUEX10-mbx-03.ad.syr.edu> In-Reply-To: <6C78E97C707B5B4C8CC61D44F8754586032615@SUEX10-mbx-03.ad.syr.edu> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit On 18/10/2011 06:19, Steven A Rowe wrote:On 18/10/2011 06:19, Steven A Rowe wrote: > Hi Paul, > > You could add a rule to the StandardTokenizer JFlex grammar to handle > this case, bypassing its other rules. Hmm, dont really understand jflex, but that is a possibility, but would prefer to do in Java code unless easy to use jflex > Another option is to create a char filter that substitutes > PUNCT-EXCLAMATION for exclamation points, PUNCT-PERIOD for periods, etc., Yes that is how I first did it > but only when the entire input consists exclusively of whitespace and > punctuation. but I couldnt work out how to only do it when exclusively whitespace and punctuation, any ideas to sole that _ > These symbols would then be left intact by StandardTokenizer. > > Steve > Paul --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org