lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Causse <dcau...@spotter.com>
Subject Re: How to get the tokens for a given document
Date Mon, 12 Apr 2010 18:26:07 GMT
Hi,

you are walking from indexReader.terms() then on indexReader.termDocs(Term t) 
for each term and then match your docID on the termsDocs enum? So you walk
the whole index?

You need a forward index and lucene is inverted but you have IMHO 2
solutions with lucene (sadly, they both require re-indexing):
 - Store the text you indexed, when you have to walk terms inside a doc,
   just, load that field and analyze it again.
 - Use a TermVector, when you create your content field use the
   constructor which accept the TermVector enum. You can then walk on it
   at search time : indexReader.getTermFreqVector(ID, fieldName)

Hope it helps.

On Mon, Apr 12, 2010 at 11:15:13AM -0700, Herbert Roitblat wrote:
> Hi, folks.
> I appreciate the help people have been offering.
> Here is my problem.  My immediate need is to get the tokens for a document from the Lucene
index.  I have a list of documents that I walk, one at a time.  Right now, I am getting the
tokens and their frequencies and the problem is that these stay in the heap as I move from
document to document.
> 
> Is there another way to get the tokens given a document ID?
> 
> Thanks,
> I'm looking for alternative ways to skin this cat.
> 
> Herb

-- 
David Causse
Spotter
http://www.spotter.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message