Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 43EA59941 for ; Wed, 19 Oct 2011 09:27:07 +0000 (UTC) Received: (qmail 87300 invoked by uid 500); 19 Oct 2011 09:27:05 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 87236 invoked by uid 500); 19 Oct 2011 09:26:58 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 87228 invoked by uid 99); 19 Oct 2011 09:26:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Oct 2011 09:26:57 +0000 X-ASF-Spam-Status: No, hits=0.4 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,FREEMAIL_REPLYTO_END_DIGIT,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of paul_t100@fastmail.fm designates 66.111.4.29 as permitted sender) Received: from [66.111.4.29] (HELO out5.smtp.messagingengine.com) (66.111.4.29) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Oct 2011 09:26:50 +0000 Received: from compute4.internal (compute4.nyi.mail.srv.osa [10.202.2.44]) by gateway1.nyi.mail.srv.osa (Postfix) with ESMTP id 88A7D20702; Wed, 19 Oct 2011 05:26:29 -0400 (EDT) Received: from frontend1.nyi.mail.srv.osa ([10.202.2.160]) by compute4.internal (MEProxy); Wed, 19 Oct 2011 05:26:29 -0400 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=fastmail.fm; h= message-id:date:from:reply-to:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; s=mesmtp; bh=xS/+7PZmsEtsFc4dinLUty8XSFY=; b=bfak/bBwnLkXcN3ea7 la5uGzOaMArWESJ33SUkD4ArMtpwMJYXfTdbi0GRMqUCwLAykDRlHPjTFSw0wpuL ppWCrSMY8TWLkAbcfWIWbKiB8V4RiwmJTRtwKk/VpcXTHTv9avAuxGurjL2ojmT0 O1FytPBKiHfemOtNwQyQSMaHM= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=message-id:date:from:reply-to :mime-version:to:cc:subject:references:in-reply-to:content-type :content-transfer-encoding; s=smtpout; bh=xS/+7PZmsEtsFc4dinLUty 8XSFY=; b=B/paJOtIcu4ZYB3kEfVQjzmxwGb1/lftabIXZieg+qgE2GmxXIZtER mV6N8/PVKlxk+E57mDlKIohXfjvAQiXdT4wwVzB5k59D6MflNMZ8lhRJ2Fe48zg2 +ydZbthCzrihORdNOwk6EtIMs3EiP6bigJQcx5IcT8tsJuX3howZQ= X-Sasl-enc: bVl166U3rrvdubWtS4fokGUwPI/iwrxu2rU8yvxj/MtV 1319016389 Received: from macbook.lan (unknown [217.155.98.246]) by mail.messagingengine.com (Postfix) with ESMTPSA id EBB6F407B5A; Wed, 19 Oct 2011 05:26:28 -0400 (EDT) Message-ID: <4E9E97C4.3010503@fastmail.fm> Date: Wed, 19 Oct 2011 10:26:28 +0100 From: Paul Taylor Reply-To: paul_t100@fastmail.fm User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1 MIME-Version: 1.0 To: java-user@lucene.apache.org CC: Steven A Rowe Subject: Re: How do you see if a tokenstream has tokens without consuming the tokens ? References: <4E9C1BCB.7080900@fastmail.fm> <6C78E97C707B5B4C8CC61D44F8754586032615@SUEX10-mbx-03.ad.syr.edu> <4E9D3F6B.6080009@fastmail.fm> <6C78E97C707B5B4C8CC61D44F875458603292C@SUEX10-mbx-03.ad.syr.edu> In-Reply-To: <6C78E97C707B5B4C8CC61D44F875458603292C@SUEX10-mbx-03.ad.syr.edu> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit On 18/10/2011 15:25, Steven A Rowe wrote: > Hi Paul, > > On 10/18/2011 at 4:57 AM, Paul Taylor wrote: >> On 18/10/2011 06:19, Steven A Rowe wrote: >>> Another option is to create a char filter that substitutes >>> PUNCT-EXCLAMATION for exclamation points, PUNCT-PERIOD for periods, >>> etc., >> Yes that is how I first did it > No, I don't think you did. When I say "char filter" I'm referring to CharFilter - this is a different kind of thing from the token filter approach you described taking previously. If you look at the code you can see I do use a CharFilter: NormalizeCharMap specialcharConvertMap = new NormalizeCharMap(); specialcharConvertMap.add("!", "Exclamation"); specialcharConvertMap.add("?","QuestionMark"); ............... public TokenStream tokenStream(String fieldName, Reader reader) { CharFilter specialCharFilter = new MappingCharFilter(specialcharConvertMap,reader); StandardTokenizer tokenStream = new StandardTokenizer(LuceneVersion.LUCENE_VERSION); try { if(tokenStream.incrementToken()==false) { tokenStream = new StandardTokenizer(LuceneVersion.LUCENE_VERSION, specialCharFilter); } else { //TODO **************** set tokenstream back as it was before increment token } } catch(IOException ioe) { } TokenStream result = new LowercaseFilter(result); return result; } > > If you go with a CharFilter, you can give it access to the entire input at once, and use a regular expression (or something like it) to assess the input and then behave accordingly. > > Steve > Well this is the problem, you cant use a regular expression or even if you did would that really slow things down wouldn't it, seeing as 99% dont need the transformation. Paul --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org