lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cool Coder <techcool.ku...@yahoo.com>
Subject Re: Best way to count tokens
Date Thu, 01 Nov 2007 21:02:42 GMT
Currently I have extended StandardAnalyzer and counting tokens in the following way. But the
index is not getting created , though I call tokenStream.reset(). I am not sure whether reset()
on token stream works or not??? I am debugging now
   
  public TokenStream tokenStream(String fieldName, Reader reader) {
  TokenStream result = super.tokenStream(fieldName,new HTMLStripReader(reader));
  //To count tokens and put in a Map
   analyzeTokens(result);
  try {
  result.reset();
  } catch (IOException e) {
  // TODO Auto-generated catch block
  e.printStackTrace();
  }
  return result;
  }
   
  public void analyzeTokens(TokenStream result)
  {
  try {
  Token token = result.next();
  while(token != null)
  {
  String tokenStr = token.termText();
  if(TokenHolder.tokenMap.get(tokenStr) == null)
  {
  TokenHolder.tokenMap.put(tokenStr,1);
  }
  else
  {
  TokenHolder.tokenMap.put(tokenStr,Integer.parseInt(TokenHolder.tokenMap.get(tokenStr).toString())+1);
  }
  token = result.next();
  
  }
  //exxtra reset 
  result.reset();
  } catch (IOException e) {
  e.printStackTrace();
  }
  }
  

Karl Wettin <karl.wettin@gmail.com> wrote:
  
1 nov 2007 kl. 18.09 skrev Cool Coder:

> prior to adding into index

Easiest way out would be to add the document to a temporary index and 
extract the term frequency vector. I would recommend using MemoryIndex.

You could also tokenize the document and pass the data to a 
TermVectorMapper. You could consider replacing the fields of the 
document with CachedTokenStreams if you got the RAM to spare and 
don't want to waste CPU analyzing the document twice. I welcome 
TermVectorMappingChachedTokenStreamFactory. Even cooler would be to 
pass code down the IndexWriter.addDocument using a command pattern or 
something, allowing one to extend the document at the time of the 
analysis.


-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message