lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cool Coder <techcool.ku...@yahoo.com>
Subject Re: Best way to count tokens
Date Fri, 02 Nov 2007 15:13:46 GMT
This works and I can reuse token streams. But why TokenStream.reset() does not work which was
in my earlier case. Is this a marker method in TokenStream without implementation and CachingTokenFilter
implements the method.
   
  - BR


Mark Miller <markrmiller@gmail.com> wrote:
  reset is optional. StandardAnalyzer does not implement it. Check out 
CachingTokenFilter and wrap StandardAnalzyer in it.

Cool Coder wrote:
> Currently I have extended StandardAnalyzer and counting tokens in the following way.
But the index is not getting created , though I call tokenStream.reset(). I am not sure whether
reset() on token stream works or not??? I am debugging now
> 
> public TokenStream tokenStream(String fieldName, Reader reader) {
> TokenStream result = super.tokenStream(fieldName,new HTMLStripReader(reader));
> //To count tokens and put in a Map
> analyzeTokens(result);
> try {
> result.reset();
> } catch (IOException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
> return result;
> }
> 
> public void analyzeTokens(TokenStream result)
> {
> try {
> Token token = result.next();
> while(token != null)
> {
> String tokenStr = token.termText();
> if(TokenHolder.tokenMap.get(tokenStr) == null)
> {
> TokenHolder.tokenMap.put(tokenStr,1);
> }
> else
> {
> TokenHolder.tokenMap.put(tokenStr,Integer.parseInt(TokenHolder.tokenMap.get(tokenStr).toString())+1);
> }
> token = result.next();
> 
> }
> //exxtra reset 
> result.reset();
> } catch (IOException e) {
> e.printStackTrace();
> }
> }
> 
>
> Karl Wettin wrote:
> 
> 1 nov 2007 kl. 18.09 skrev Cool Coder:
>
> 
>> prior to adding into index
>> 
>
> Easiest way out would be to add the document to a temporary index and 
> extract the term frequency vector. I would recommend using MemoryIndex.
>
> You could also tokenize the document and pass the data to a 
> TermVectorMapper. You could consider replacing the fields of the 
> document with CachedTokenStreams if you got the RAM to spare and 
> don't want to waste CPU analyzing the document twice. I welcome 
> TermVectorMappingChachedTokenStreamFactory. Even cooler would be to 
> pass code down the IndexWriter.addDocument using a command pattern or 
> something, allowing one to extend the document at the time of the 
> analysis.
>
>
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message