Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 5831 invoked from network); 8 Jun 2007 18:54:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 8 Jun 2007 18:54:56 -0000 Received: (qmail 64507 invoked by uid 500); 8 Jun 2007 18:54:51 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 64464 invoked by uid 500); 8 Jun 2007 18:54:51 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 64453 invoked by uid 99); 8 Jun 2007 18:54:51 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Jun 2007 11:54:51 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [66.194.80.196] (HELO dc1.storediq.com) (66.194.80.196) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Jun 2007 11:54:47 -0700 X-MimeOLE: Produced By Microsoft Exchange V6.0.6603.0 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: Question about querying for files in a zip file Date: Fri, 8 Jun 2007 13:54:26 -0500 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Question about querying for files in a zip file Thread-Index: Acep/m+SpWOahlFDRciYIgxcoJMKXA== From: "Eric Scott" To: X-Virus-Checked: Checked by ClamAV on apache.org This isn't a "How do I index a zip file?" question. It's a bit more complicated than that. We have an index where zip files are broken apart and the contained files are indexed. The index also contains a doc for the zip file itself. The user has the option of (A) querying for the contained files that match the query (a vanilla query), or (B) querying for the unique set of zip files that have contained files that match the query. My question is how to *efficiently* accomplish option (B) in Lucene. In case it helps, here's another way to explain the requirement in a relational model. If you had a table of docs with these columns: MyDocs table =3D=3D=3D=3D=3D=3D=3D=3D=3D Docid ZipfileName Filename Other columns to match on... then option (B) can be returned with a simple join: select distinct zip.docid, zip.other-columns, ... from mydocs zip, mydocs contained where contained.zipfilename =3D zip.filename and contained.docid matches lucene query... In lucene, the conceptual, straght-forward solution is something like this: Do a lucene query to get the matching contained docs. For each matching doc: Look up the zip filename via a field on the doc. If the zip file is not part of our zipfile result set yet, then Save the zip filename in the result set. Run another lucene query to look up the zipfile docids in the zipfile result set. Read any required fields for each zipfile doc. Return the zipfile result set with the required fields. The trouble with this solution is that it is very slow and a memory hog. Does anyone have any nifty ideas that beat this straight-forward approach? We would also entertain alternative indexing approaches. We even considered concatenating all the text of the contained docs into a doc indexed as the zipfile, but lucene only indexes part of a large file and even if that were resolved, proximity searches can return false positives. And FYI, scoring is not an issue on the zip file. It's purely match or no-match semantics. Thanks, - Eric Scott --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org