lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Causse <>
Subject Re: How to get the tokens for a given document
Date Mon, 12 Apr 2010 18:26:07 GMT

you are walking from indexReader.terms() then on indexReader.termDocs(Term t) 
for each term and then match your docID on the termsDocs enum? So you walk
the whole index?

You need a forward index and lucene is inverted but you have IMHO 2
solutions with lucene (sadly, they both require re-indexing):
 - Store the text you indexed, when you have to walk terms inside a doc,
   just, load that field and analyze it again.
 - Use a TermVector, when you create your content field use the
   constructor which accept the TermVector enum. You can then walk on it
   at search time : indexReader.getTermFreqVector(ID, fieldName)

Hope it helps.

On Mon, Apr 12, 2010 at 11:15:13AM -0700, Herbert Roitblat wrote:
> Hi, folks.
> I appreciate the help people have been offering.
> Here is my problem.  My immediate need is to get the tokens for a document from the Lucene
index.  I have a list of documents that I walk, one at a time.  Right now, I am getting the
tokens and their frequencies and the problem is that these stay in the heap as I move from
document to document.
> Is there another way to get the tokens given a document ID?
> Thanks,
> I'm looking for alternative ways to skin this cat.
> Herb

David Causse

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message