Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of paul_t100@fastmail.fm
 designates 66.111.4.29 as permitted sender)
Message-ID: <4E9E97C4.3010503@fastmail.fm>
Date: Wed, 19 Oct 2011 10:26:28 +0100
From: Paul Taylor <paul_t100@fastmail.fm>
Reply-To: paul_t100@fastmail.fm
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7;
 rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1
MIME-Version: 1.0
To: java-user@lucene.apache.org
CC: Steven A Rowe <sarowe@syr.edu>
Subject: Re: How do you see if a tokenstream has tokens without consuming
 the tokens ?
References: <4E9C1BCB.7080900@fastmail.fm>
 <6C78E97C707B5B4C8CC61D44F8754586032615@SUEX10-mbx-03.ad.syr.edu>
 <4E9D3F6B.6080009@fastmail.fm>
 <6C78E97C707B5B4C8CC61D44F875458603292C@SUEX10-mbx-03.ad.syr.edu>
In-Reply-To: <6C78E97C707B5B4C8CC61D44F875458603292C@SUEX10-mbx-03.ad.syr.edu>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

On 18/10/2011 15:25, Steven A Rowe wrote:
> Hi Paul,
>
> On 10/18/2011 at 4:57 AM, Paul Taylor wrote:
>> On 18/10/2011 06:19, Steven A Rowe wrote:
>>> Another option is to create a char filter that substitutes
>>> PUNCT-EXCLAMATION for exclamation points, PUNCT-PERIOD for periods,
>>> etc.,
>> Yes that is how I first did it
> No, I don't think you did.  When I say "char filter" I'm referring to CharFilter<http://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/analysis/CharFilter.html>  - this is a different kind of thing from the token filter approach you described taking previously.
If you look at the code you can see I do use a CharFilter:

NormalizeCharMap specialcharConvertMap = new NormalizeCharMap();
     specialcharConvertMap.add("!", "Exclamation");
     specialcharConvertMap.add("?","QuestionMark");
     ...............

     public  TokenStream tokenStream(String fieldName, Reader reader) {
         CharFilter specialCharFilter = new 
MappingCharFilter(specialcharConvertMap,reader);

         StandardTokenizer tokenStream = new 
StandardTokenizer(LuceneVersion.LUCENE_VERSION);
         try
         {
             if(tokenStream.incrementToken()==false)
             {
                 tokenStream = new 
StandardTokenizer(LuceneVersion.LUCENE_VERSION, specialCharFilter);
             }
             else
             {
                 //TODO **************** set tokenstream back as it was 
before increment token
             }
         }
         catch(IOException ioe)
         {

         }
         TokenStream result = new LowercaseFilter(result);
         return result;
     }


>
> If you go with a CharFilter, you can give it access to the entire input at once, and use a regular expression (or something like it) to assess the input and then behave accordingly.
>
> Steve
>
Well this is the problem, you cant use a regular expression or even if 
you did would that really slow things down wouldn't it, seeing as 99% 
dont need the transformation.

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org