lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Paul Sondag" <jsond...@uiuc.edu>
Subject Does Index have a Tokenizer Built into it
Date Thu, 12 Jul 2007 17:53:38 GMT
Hi,

When Lucene's standard Indexer is used to store documents does it store the
information about the tokens in anyway.  I'm playing around with making a
Snippet Generator (like the highlighter class), and it is going to involve a
very large amount of documents.  For my test cases I have only used one
document and simply passed the document into the StandardTokenizer.  But now
I am ready to start working with a large amount of documents.  I know one
option is to store the text of a document as a field and then open the index
and pass the text of the document into a tokenizer, but storing the text of
each document costs me way too much.  I'm wondering if after opening the
index I can retrieve the Tokens (not the terms) of a document, something
akin to IndexReader.Document(n).getTokenizer().

In summary:

My current ( too wasteful implementation is this)

StandardTokenizer(BufferedReader (  IndexReader.Document(n).getField("text"
)  )

I'm wondering if Lucene has a more efficient manner to retrieve the tokens
of a document from an index.  Because it seems like it has information about
every "term" already, Since you can get retrieve a TermPositions object.

Thanks,


--JP

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message