Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 77579 invoked from network); 12 Jul 2007 17:54:07 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 12 Jul 2007 17:54:07 -0000 Received: (qmail 20875 invoked by uid 500); 12 Jul 2007 17:54:03 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 20840 invoked by uid 500); 12 Jul 2007 17:54:02 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 20829 invoked by uid 99); 12 Jul 2007 17:54:02 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Jul 2007 10:54:02 -0700 X-ASF-Spam-Status: No, hits=3.2 required=10.0 tests=DNS_FROM_AHBL_RHSBL,HTML_10_20,HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of jpsondag@gmail.com designates 64.233.166.177 as permitted sender) Received: from [64.233.166.177] (HELO py-out-1112.google.com) (64.233.166.177) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Jul 2007 10:53:59 -0700 Received: by py-out-1112.google.com with SMTP id f31so732349pyh for ; Thu, 12 Jul 2007 10:53:38 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:mime-version:content-type:x-google-sender-auth; b=KVwCqMelVVEsHuTufLIeULB5Tgfyp+cn1+tELgcn5zdjYjVm0apROCQyQfcmnQpkORDkAaIohNz27QeqUrLOOcghK0e51LSApmnCyI+gRWKBMrrwonNoi/Q/YSBUICF303eZvtbLStfJvImEW9p6Xmsm0doTXGREQ3ROajWXuRg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:sender:to:subject:mime-version:content-type:x-google-sender-auth; b=PLDGm0ZEAiVolUFbuE8lP7YOFFyh/QP1oY3JAe1BIsJMZPoEPkc2QlyUbE1LYtUCo9doaDOpqi8a36rhkiEqckOkh0KtSSrXwRC+NmsK3JOLIYSzG8JYqrQh1pC/hfKVBD662cliIxt/jFOvG2wdC3wrEKfLGG/XZAgH/4SebOw= Received: by 10.65.43.5 with SMTP id v5mr1414483qbj.1184262818532; Thu, 12 Jul 2007 10:53:38 -0700 (PDT) Received: by 10.65.177.7 with HTTP; Thu, 12 Jul 2007 10:53:38 -0700 (PDT) Message-ID: Date: Thu, 12 Jul 2007 12:53:38 -0500 From: "John Paul Sondag" Sender: jpsondag@gmail.com To: java-user@lucene.apache.org Subject: Does Index have a Tokenizer Built into it MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_22126_22818566.1184262818505" X-Google-Sender-Auth: b5ca981d59374b61 X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_22126_22818566.1184262818505 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline Hi, When Lucene's standard Indexer is used to store documents does it store the information about the tokens in anyway. I'm playing around with making a Snippet Generator (like the highlighter class), and it is going to involve a very large amount of documents. For my test cases I have only used one document and simply passed the document into the StandardTokenizer. But now I am ready to start working with a large amount of documents. I know one option is to store the text of a document as a field and then open the index and pass the text of the document into a tokenizer, but storing the text of each document costs me way too much. I'm wondering if after opening the index I can retrieve the Tokens (not the terms) of a document, something akin to IndexReader.Document(n).getTokenizer(). In summary: My current ( too wasteful implementation is this) StandardTokenizer(BufferedReader ( IndexReader.Document(n).getField("text" ) ) I'm wondering if Lucene has a more efficient manner to retrieve the tokens of a document from an index. Because it seems like it has information about every "term" already, Since you can get retrieve a TermPositions object. Thanks, --JP ------=_Part_22126_22818566.1184262818505--