Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 98369 invoked from network); 12 Apr 2010 20:28:02 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 12 Apr 2010 20:28:02 -0000 Received: (qmail 51615 invoked by uid 500); 12 Apr 2010 20:28:00 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 51571 invoked by uid 500); 12 Apr 2010 20:28:00 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 51563 invoked by uid 99); 12 Apr 2010 20:28:00 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Apr 2010 20:28:00 +0000 X-ASF-Spam-Status: No, hits=2.9 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [74.208.4.195] (HELO mout.perfora.net) (74.208.4.195) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Apr 2010 20:27:51 +0000 Received: from [192.168.1.155] (cpe-75-84-68-253.socal.res.rr.com [75.84.68.253]) by mrelay.perfora.net (node=mrus0) with ESMTP (Nemesis) id 0MguO8-1NoJEE2mAq-00MjGk; Mon, 12 Apr 2010 16:27:29 -0400 Message-ID: <4BC3822E.8050902@orcatec.com> Date: Mon, 12 Apr 2010 13:27:26 -0700 From: Herbert L Roitblat User-Agent: Thunderbird 2.0.0.23 (X11/20090817) MIME-Version: 1.0 To: David Causse , java-user@lucene.apache.org Subject: Re: How to get the tokens for a given document References: <67314EA3672C42C698A1C552371FB2CA@Rissos> <20100412182607.GH28457@spotter-dclnx> In-Reply-To: <20100412182607.GH28457@spotter-dclnx> Content-Type: multipart/alternative; boundary="------------000708090704070709070607" X-Provags-ID: V01U2FsdGVkX1/x+s1SKQ13+UeV+w4988atZeWEZDqG8+IiSe6 dpZtLqBKvD6+Nnna0bi0lwkDDHbtGbgIeTyqH2K1D4bhd7X/4q ZaLbf+BrJ+8SO+gNxUxkw== X-Virus-Checked: Checked by ClamAV on apache.org --------------000708090704070709070607 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Thanks David. I think that I neglected to say that I am using pyLucene 2.4.0. Your suggestion is almost what we're doing: >indexReader.getTermFreqVector(ID, fieldName) self.hits = list(self.lSearcher.search(self.query)) if self.hits: self.hit = lucene.Hit.cast_(self.hits[0]) self.tfvs = self.lReader.getTermFreqVectors(self.hit.id) At the very least, I may be able to reduce overhead by just adding the fieldName to the indexReader. The problem I'm facing is that all of the tokens in all of the fields in all of the documents get added to the heap, and it runs out of space. I'm looking for other ways of getting the information I need that might not fill up the heap. Thanks again. Herb Your other suggestion may also be what we end up doing. Since our documents can be in any language, I will have to make sure that I use the right analyzer. >load that field and analyze it again. David Causse wrote: > Hi, > > you are walking from indexReader.terms() then on indexReader.termDocs(Term t) > for each term and then match your docID on the termsDocs enum? So you walk > the whole index? > > You need a forward index and lucene is inverted but you have IMHO 2 > solutions with lucene (sadly, they both require re-indexing): > - Store the text you indexed, when you have to walk terms inside a doc, > just, load that field and analyze it again. > - Use a TermVector, when you create your content field use the > constructor which accept the TermVector enum. You can then walk on it > at search time : indexReader.getTermFreqVector(ID, fieldName) > > Hope it helps. > > On Mon, Apr 12, 2010 at 11:15:13AM -0700, Herbert Roitblat wrote: > >> Hi, folks. >> I appreciate the help people have been offering. >> Here is my problem. My immediate need is to get the tokens for a document from the Lucene index. I have a list of documents that I walk, one at a time. Right now, I am getting the tokens and their frequencies and the problem is that these stay in the heap as I move from document to document. >> >> Is there another way to get the tokens given a document ID? >> >> Thanks, >> I'm looking for alternative ways to skin this cat. >> >> Herb >> > > --------------000708090704070709070607--