lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <karl.wet...@gmail.com>
Subject Re: Best way to count tokens
Date Thu, 01 Nov 2007 18:55:54 GMT

1 nov 2007 kl. 18.09 skrev Cool Coder:

> prior to adding into index

Easiest way out would be to add the document to a temporary index and  
extract the term frequency vector. I would recommend using MemoryIndex.

You could also tokenize the document and pass the data to a  
TermVectorMapper. You could consider replacing the fields of the  
document with CachedTokenStreams if you got the RAM to spare and  
don't want to waste CPU analyzing the document twice. I welcome  
TermVectorMappingChachedTokenStreamFactory. Even cooler would be to  
pass code down the IndexWriter.addDocument using a command pattern or  
something, allowing one to extend the document at the time of the  
analysis.


-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message