Mailing-List: contact general-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of ted.dunning@gmail.com
 designates 74.125.46.28 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :content-type;
        b=JkoJ5vna+NAx1mrnmzvGPkXLYEaXP0Af2PW/1W/k3E7fXMNGv5vOeC/bjAmfHsKe/X
         IqIaZSlkvA5VfU7Z/Fd4CwD8BquFC/p9qmBiAcX+zBVfR22x4axQOkgrlNfc7LWwNs6N
         uH6j6ATO3t1HYn4K89w5Qbv4XruuujYRjZr2Q=
MIME-Version: 1.0
In-Reply-To: <23914028.post@talk.nabble.com>
References: <23902784.post@talk.nabble.com>
 <c7d45fc70906062339s6f274905y4ba1982e85c5da59@mail.gmail.com>
	<23911598.post@talk.nabble.com>
 <f18c9dde0906070940t2458ff3q2603b47144684ee9@mail.gmail.com>
	<23914028.post@talk.nabble.com>
From: Ted Dunning <ted.dunning@gmail.com>
Date: Sun, 7 Jun 2009 15:28:17 -0700
Message-ID: <c7d45fc70906071528h24e969bdt7a7096749926fdef@mail.gmail.com>
Subject: Re: How to structure lucene query?
To: general@lucene.apache.org
Content-Type: multipart/alternative; boundary=000e0cd6a9828f3bcf046bc9a569

--000e0cd6a9828f3bcf046bc9a569
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

You can have more than one kind of document in your index.

If you have users and reports in your index then you can search users to
find those who have touched the subjects you need and then you can search
for reports authored by one of the authors qualified in the first query.
This will be pretty efficient, especially if you paginate the list of
authors so you only retrieve the documents for a few authors at a time.

To be specific, reports would have 4 fields; author_name, author_id,
report_create_dt, report_type, and
text just as you mentioned above.  I would add a globally unique report id
called report_id.

Then authors would have a few fields: user_id, user_name,
report_types_written and report_ids_written.

Query one would be

*report_types_written:(+"XYZ" +"ABC")
*

And you would retain user_name, report_ids_written.  Note that reports don't
have this field and thus will never be retrieved here.

Suppose you find the following three authors:

*user_id: 52, user_name: Alice, report_ids_written: 1, 3, 5, 7*
*user_id: 1327, user_name: Bob, report_ids_written: 22, 11, 55, 77, 3*
*user_id: 52, user_name: Alice, report_ids_written: 4, 6, 12*

The second query would be

*report_id:(1 3 4 5 6 7 11 12 22 55 77) *

If you find thousands of reports that you want to retrieve, you should
retrieve them in batches.  If you present the authors in pages of 10 or 20,
then you are unlikely to have more than dozens of reports to retrieve per
page.

Note that the second query will only retrieve reports because authors don't
have that field.

If you want to limit the dates for the second query, you could use this:

*+report_id:(1 3 4 5 6 7 11 12 22 55 77) **+report_create_dt:[20051009 TO
20090605]*

Note how there is now a + on the first term.  It could have been used on the
first version of the second query but would have had no effect.  Once you
add additional terms, however, you need to include the plusses to make sure
you strictly apply all the conditions you need.

Better?


On Sun, Jun 7, 2009 at 11:48 AM, ywlee522 <ywlee522@gmail.com> wrote:

> The query is "of the users who has one or more reports containing "ABC",
> find users who also has one or more reports containing "XYZ".
>
> ..
>
> If I put all reports of a user into a single Lucene document, then it is
> equal to find all documents containing  both "ABC" and "XYZ".  But, then, i
> will lose the report_dt field, which is another parameter in the query.
>

--000e0cd6a9828f3bcf046bc9a569--