Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (herse.apache.org: local policy)
content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Subject: Question about querying for files in a zip file
Date: Fri, 8 Jun 2007 13:54:26 -0500
Message-ID: <E387E2E9622FDD408359F98BF183879ED2BDAA@dc1.storediq.com>
Thread-Topic: Question about querying for files in a zip file
Thread-Index: Acep/m+SpWOahlFDRciYIgxcoJMKXA==
From: "Eric Scott" <escott@storediq.com>
To: <java-user@lucene.apache.org>

This isn't a "How do I index a zip file?" question.  It's a bit more
complicated than that.

We have an index where zip files are broken apart and the contained
files are indexed.  The index also contains a doc for the zip file
itself.  The user has the option of (A) querying for the contained files
that match the query (a vanilla query), or (B) querying for the unique
set of zip files that have contained files that match the query.  My
question is how to *efficiently* accomplish option (B) in Lucene.

In case it helps, here's another way to explain the requirement in a
relational model.  If you had a table of docs with these columns:

    MyDocs table
    =3D=3D=3D=3D=3D=3D=3D=3D=3D
    Docid
    ZipfileName
    Filename
    Other columns to match on...

then option (B) can be returned with a simple join:

    select distinct zip.docid, zip.other-columns, ...
    from mydocs zip, mydocs contained
    where
        contained.zipfilename =3D zip.filename
        and contained.docid matches lucene query...

In lucene, the conceptual, straght-forward solution is something like
this:

    Do a lucene query to get the matching contained docs.
    For each matching doc:
        Look up the zip filename via a field on the doc.
        If the zip file is not part of our zipfile result set yet, then
            Save the zip filename in the result set.
    Run another lucene query to look up the zipfile docids in the
zipfile result set.
    Read any required fields for each zipfile doc.
    Return the zipfile result set with the required fields.

The trouble with this solution is that it is very slow and a memory hog.
Does anyone have any nifty ideas that beat this straight-forward
approach?

We would also entertain alternative indexing approaches.  We even
considered concatenating all the text of the contained docs into a doc
indexed as the zipfile, but lucene only indexes part of a large file and
even if that were resolved, proximity searches can return false
positives.

And FYI, scoring is not an issue on the zip file.  It's purely match or
no-match semantics.

Thanks,

- Eric Scott


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org