lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: Best way to count tokens
Date Fri, 02 Nov 2007 00:04:00 GMT
reset is optional. StandardAnalyzer does not implement it. Check out 
CachingTokenFilter and wrap StandardAnalzyer in it.

Cool Coder wrote:
> Currently I have extended StandardAnalyzer and counting tokens in the following way.
But the index is not getting created , though I call tokenStream.reset(). I am not sure whether
reset() on token stream works or not??? I am debugging now
>    
>   public TokenStream tokenStream(String fieldName, Reader reader) {
>   TokenStream result = super.tokenStream(fieldName,new HTMLStripReader(reader));
>   //To count tokens and put in a Map
>    analyzeTokens(result);
>   try {
>   result.reset();
>   } catch (IOException e) {
>   // TODO Auto-generated catch block
>   e.printStackTrace();
>   }
>   return result;
>   }
>    
>   public void analyzeTokens(TokenStream result)
>   {
>   try {
>   Token token = result.next();
>   while(token != null)
>   {
>   String tokenStr = token.termText();
>   if(TokenHolder.tokenMap.get(tokenStr) == null)
>   {
>   TokenHolder.tokenMap.put(tokenStr,1);
>   }
>   else
>   {
>   TokenHolder.tokenMap.put(tokenStr,Integer.parseInt(TokenHolder.tokenMap.get(tokenStr).toString())+1);
>   }
>   token = result.next();
>   
>   }
>   //exxtra reset 
>   result.reset();
>   } catch (IOException e) {
>   e.printStackTrace();
>   }
>   }
>   
>
> Karl Wettin <karl.wettin@gmail.com> wrote:
>   
> 1 nov 2007 kl. 18.09 skrev Cool Coder:
>
>   
>> prior to adding into index
>>     
>
> Easiest way out would be to add the document to a temporary index and 
> extract the term frequency vector. I would recommend using MemoryIndex.
>
> You could also tokenize the document and pass the data to a 
> TermVectorMapper. You could consider replacing the fields of the 
> document with CachedTokenStreams if you got the RAM to spare and 
> don't want to waste CPU analyzing the document twice. I welcome 
> TermVectorMappingChachedTokenStreamFactory. Even cooler would be to 
> pass code down the IndexWriter.addDocument using a command pattern or 
> something, allowing one to extend the document at the time of the 
> analysis.
>
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message