lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: pieces missing in reusable analyzers?
Date Mon, 10 Aug 2009 22:23:25 GMT
Also, FYI, if you are testing this with Solr or whatever, I want to
warn you that also inside LUCENE-1794 is impls of reset(Reader) and
reset() for tokenizers and filters that did not have it before (i.e.
CJK).

So it is not enough to reuse in the analyzer, its streams that keep
state really need to implement reset() and zero their offsets or
whatever it is they should do or you will get strange results.

On Mon, Aug 10, 2009 at 6:18 PM, Uwe Schindler<uwe@thetaphi.de> wrote:
> You have to reuse the TokenStream and also its root Tokenizer to get access
> to the Reader. This is what the latest patch of Robert does with this helper
> class.
>
> Implementing reset(Reader) in TokenStream is somehow - wrong. There may be
> TokenStreams that have no Readers at all (NumericTokenStream). Readers are
> only known to Tokenizers.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>> -----Original Message-----
>> From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik
>> Seeley
>> Sent: Tuesday, August 11, 2009 12:10 AM
>> To: java-dev@lucene.apache.org
>> Subject: pieces missing in reusable analyzers?
>>
>> I had thought that implementing reusable analyzers in solr was going
>> to be cake... but either I'm missing something, or Lucene is missing
>> something.
>>
>> Here's the way that one used to create custom analyzers:
>>
>> class CustomAnalyzer extends Analyzer {
>>   public TokenStream tokenStream(String fieldName, Reader reader) {
>>     return new LowerCaseFilter(new NGramTokenFilter(new
>> StandardTokenizer(reader)));
>>   }
>> }
>>
>>
>> Now let's try to make this reusable:
>>
>> class CustomAnalyzer2 extends Analyzer {
>>   public TokenStream tokenStream(String fieldName, Reader reader) {
>>     return new LowerCaseFilter(new NGramTokenFilter(new
>> StandardTokenizer(reader)));
>>   }
>>
>>   @Override
>>   public TokenStream reusableTokenStream(String fieldName, Reader
>> reader) throws IOException {
>>     TokenStream ts = getPreviousTokenStream();
>>     if (ts == null) {
>>       ts = tokenStream(fieldName, reader);
>>       setPreviousTokenStream(ts);
>>       return ts;
>>     } else {
>>       // uh... how do I reset a token stream?
>>       return ts;
>>     }
>>   }
>> }
>>
>>
>> See the missing piece?  Seems like TokenStream needs a reset(Reader r)
>> method or something?
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message