Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (herse.apache.org: domain of jpsondag@gmail.com designates
 64.233.166.177 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=beta;
        h=received:message-id:date:from:sender:to:subject:mime-version:content-type:x-google-sender-auth;
        b=PLDGm0ZEAiVolUFbuE8lP7YOFFyh/QP1oY3JAe1BIsJMZPoEPkc2QlyUbE1LYtUCo9doaDOpqi8a36rhkiEqckOkh0KtSSrXwRC+NmsK3JOLIYSzG8JYqrQh1pC/hfKVBD662cliIxt/jFOvG2wdC3wrEKfLGG/XZAgH/4SebOw=
Message-ID: <bd67edef0707121053m3d692818r7ccdc238d07fdd74@mail.gmail.com>
Date: Thu, 12 Jul 2007 12:53:38 -0500
From: "John Paul Sondag" <jsondag2@uiuc.edu>
Sender: jpsondag@gmail.com
To: java-user@lucene.apache.org
Subject: Does Index have a Tokenizer Built into it
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_22126_22818566.1184262818505"

------=_Part_22126_22818566.1184262818505
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Hi,

When Lucene's standard Indexer is used to store documents does it store the
information about the tokens in anyway.  I'm playing around with making a
Snippet Generator (like the highlighter class), and it is going to involve a
very large amount of documents.  For my test cases I have only used one
document and simply passed the document into the StandardTokenizer.  But now
I am ready to start working with a large amount of documents.  I know one
option is to store the text of a document as a field and then open the index
and pass the text of the document into a tokenizer, but storing the text of
each document costs me way too much.  I'm wondering if after opening the
index I can retrieve the Tokens (not the terms) of a document, something
akin to IndexReader.Document(n).getTokenizer().

In summary:

My current ( too wasteful implementation is this)

StandardTokenizer(BufferedReader (  IndexReader.Document(n).getField("text"
)  )

I'm wondering if Lucene has a more efficient manner to retrieve the tokens
of a document from an index.  Because it seems like it has information about
every "term" already, Since you can get retrieve a TermPositions object.

Thanks,


--JP

------=_Part_22126_22818566.1184262818505--