Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 40862 invoked from network); 11 Jun 2008 14:06:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 11 Jun 2008 14:06:26 -0000 Received: (qmail 97635 invoked by uid 500); 11 Jun 2008 14:06:21 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 97608 invoked by uid 500); 11 Jun 2008 14:06:21 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 97597 invoked by uid 99); 11 Jun 2008 14:06:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Jun 2008 07:06:21 -0700 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS,WHOIS_MYPRIVREG X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of erickerickson@gmail.com designates 72.14.220.156 as permitted sender) Received: from [72.14.220.156] (HELO fg-out-1718.google.com) (72.14.220.156) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Jun 2008 14:05:31 +0000 Received: by fg-out-1718.google.com with SMTP id l27so2656061fgb.27 for ; Wed, 11 Jun 2008 07:05:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type:references; bh=4NvEYPBPg70zjz5aitxRZM3AWZ1nfyqvaGFst8jsNfE=; b=qllvSfad81Gow6G6VH9gWGE974pqtlrriZ/WleIuA6/sD91hulJ2shgHVspOu2tbLz g7+AQAsah37+quGZkZl7+R5yYjmAD5G1iJchMbK9/YY6fiZsrNXHnTYafYFv2haklivX D/1MHSioKuGYGCgZeTzw9l3PhedidzXB7xvYs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:references; b=ZTFs1YXn0bJoJcUYDl91ahZ8nm9s5cD8jvcfGRM5zyog3wsesF9dQ9ipFHz7kh+kYR zdpHEC0MzMpH02XSu9xZRGC6Z+3EJWNWSwBPktbyBZsNn0jmvAnIlBS6EGLS2AdpL2lT i7WYruS2QhDNabeXOMUyPgnN1rlSQXDjbt5DM= Received: by 10.86.29.19 with SMTP id c19mr279081fgc.28.1213193147539; Wed, 11 Jun 2008 07:05:47 -0700 (PDT) Received: by 10.86.63.2 with HTTP; Wed, 11 Jun 2008 07:05:47 -0700 (PDT) Message-ID: <359a92830806110705w362342b3wf70d67e1cd3da5c7@mail.gmail.com> Date: Wed, 11 Jun 2008 10:05:47 -0400 From: "Erick Erickson" To: java-user@lucene.apache.org Subject: Re: retrieve all docs efficiently - just one field In-Reply-To: <17766268.post@talk.nabble.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_30864_12857625.1213193147540" References: <17766268.post@talk.nabble.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_30864_12857625.1213193147540 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline <<>> I infer from this that you're using a Hits object to get your IDs to insert in your temporary table. Here's the problem with Hits... It re-executes the query every 100 (200?) hits. So you can think of it as while (more hits) { if ((count % 100) == 0) execute the search and throw away the first items work with the document } It can be a major bottleneck to re-execute the query every 100 hits you look at. HitCollector avoids this re-execution, and can result in very significant speedups when iterating through many documents. FieldSelector will allow lazy fetching. That is, when you do something like Reader.document(idx, selector) you'll be able to only load those fields from the document that you specify with the selector. In your case, you would only load the ID you care about and insert that in your temporary table. This can also result in very significant savings, especially if you only want to load a very small field from a document that has very large fields. See a writeup I did for one of my projects on the Lucene Wiki http://wiki.apache.org/lucene-java/FieldSelectorPerformance?highlight=(FieldSelector) Hope this helps Erick On Tue, Jun 10, 2008 at 6:35 PM, 1world1love wrote: > > Greetings all. I have read many posts concerning similar use cases, but I > am > still a little hazy on the best way to achieve what I need to do. Here is > the background: > > 2 million documents with multiple sections, some sections contain > structured > data, some unstructured. > > We parse the docs and place the structured stuff in oracle where each > section is a table and one master table to relate them all. > > We index the unstructured sections with lucene where each section is a > document (meaning a total of about ~30 million documents) with extra fields > including one for the primary key of the master table and then some meta > fields to describe the section - type, date, etc. > > For a common use case, say we have a table called demographics with a > number > field that represents age (overly simplistic but gets the point across). > > So say we want all people over the age of 50 who may have visited Panama: > > -- > We have our lucene index and we want to search the section text for the > word > "panama" > > AND > > We want to select from the demographics table where age > 50. > -- > > Now I need to intersect the master table IDs from my lucene hits and my > table results. > > I have a java stored procedure that runs the lucene query and creates a > temporary table with a single column where I insert the master id from the > hits of my lucene query. I then can do a join with my structured query > results. > > The problem here is obviously the speed of iterating through the hits to > extract the single field that I need. > > Notes: > - I must be able to get a full set of results, though I only need the one > id > field > - We originally went with Oracle text which was simple, but limited and > quite slow for most queries > > > I have read a little about the hitcollector class and the fieldselector > api, > but I am still not sure how they may help me or even if they can. > > I have also tooled around with the idea of using termdocs, but the queries > may get a little complex with various ors/ands/nots, though probably not > spans and so forth. > > Any suggestions will be greatly apreciated. > > Thanks, > > J > > -- > View this message in context: > http://www.nabble.com/retrieve-all-docs-efficiently---just-one-field-tp17766268p17766268.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_30864_12857625.1213193147540--