Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 26015 invoked from network); 7 Jun 2009 22:28:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 7 Jun 2009 22:28:54 -0000 Received: (qmail 56054 invoked by uid 500); 7 Jun 2009 22:29:06 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 55998 invoked by uid 500); 7 Jun 2009 22:29:06 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 55988 invoked by uid 99); 7 Jun 2009 22:29:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 07 Jun 2009 22:29:06 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ted.dunning@gmail.com designates 74.125.46.28 as permitted sender) Received: from [74.125.46.28] (HELO yw-out-2324.google.com) (74.125.46.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 07 Jun 2009 22:28:58 +0000 Received: by yw-out-2324.google.com with SMTP id 2so1301664ywt.5 for ; Sun, 07 Jun 2009 15:28:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type; bh=UniAMiViRs+5dN3wEJn82sDtgdwdRcsb9nNdz4+AKuQ=; b=ddLvnIU3wjhfBFYQeBGIqZjodUOUTiyKvEwX2t8CLQWud6weGZGqy7jOk2v4rpPCvL h4tPfmEXKuQpSh7GYs5VzvLqDou/sP9dVExhH5UvLDW2MF+/byjTtfX8a2h2iUiQKUFS Fcj/u8BD1a8rnMychFr0+9NtF5ET/Q/UUq0Iw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=JkoJ5vna+NAx1mrnmzvGPkXLYEaXP0Af2PW/1W/k3E7fXMNGv5vOeC/bjAmfHsKe/X IqIaZSlkvA5VfU7Z/Fd4CwD8BquFC/p9qmBiAcX+zBVfR22x4axQOkgrlNfc7LWwNs6N uH6j6ATO3t1HYn4K89w5Qbv4XruuujYRjZr2Q= MIME-Version: 1.0 Received: by 10.150.189.2 with SMTP id m2mr11215793ybf.89.1244413717068; Sun, 07 Jun 2009 15:28:37 -0700 (PDT) In-Reply-To: <23914028.post@talk.nabble.com> References: <23902784.post@talk.nabble.com> <23911598.post@talk.nabble.com> <23914028.post@talk.nabble.com> From: Ted Dunning Date: Sun, 7 Jun 2009 15:28:17 -0700 Message-ID: Subject: Re: How to structure lucene query? To: general@lucene.apache.org Content-Type: multipart/alternative; boundary=000e0cd6a9828f3bcf046bc9a569 X-Virus-Checked: Checked by ClamAV on apache.org --000e0cd6a9828f3bcf046bc9a569 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit You can have more than one kind of document in your index. If you have users and reports in your index then you can search users to find those who have touched the subjects you need and then you can search for reports authored by one of the authors qualified in the first query. This will be pretty efficient, especially if you paginate the list of authors so you only retrieve the documents for a few authors at a time. To be specific, reports would have 4 fields; author_name, author_id, report_create_dt, report_type, and text just as you mentioned above. I would add a globally unique report id called report_id. Then authors would have a few fields: user_id, user_name, report_types_written and report_ids_written. Query one would be *report_types_written:(+"XYZ" +"ABC") * And you would retain user_name, report_ids_written. Note that reports don't have this field and thus will never be retrieved here. Suppose you find the following three authors: *user_id: 52, user_name: Alice, report_ids_written: 1, 3, 5, 7* *user_id: 1327, user_name: Bob, report_ids_written: 22, 11, 55, 77, 3* *user_id: 52, user_name: Alice, report_ids_written: 4, 6, 12* The second query would be *report_id:(1 3 4 5 6 7 11 12 22 55 77) * If you find thousands of reports that you want to retrieve, you should retrieve them in batches. If you present the authors in pages of 10 or 20, then you are unlikely to have more than dozens of reports to retrieve per page. Note that the second query will only retrieve reports because authors don't have that field. If you want to limit the dates for the second query, you could use this: *+report_id:(1 3 4 5 6 7 11 12 22 55 77) **+report_create_dt:[20051009 TO 20090605]* Note how there is now a + on the first term. It could have been used on the first version of the second query but would have had no effect. Once you add additional terms, however, you need to include the plusses to make sure you strictly apply all the conditions you need. Better? On Sun, Jun 7, 2009 at 11:48 AM, ywlee522 wrote: > The query is "of the users who has one or more reports containing "ABC", > find users who also has one or more reports containing "XYZ". > > .. > > If I put all reports of a user into a single Lucene document, then it is > equal to find all documents containing both "ABC" and "XYZ". But, then, i > will lose the report_dt field, which is another parameter in the query. > --000e0cd6a9828f3bcf046bc9a569--