Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 18646 invoked from network); 17 Feb 2009 21:43:07 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 17 Feb 2009 21:43:07 -0000 Received: (qmail 38673 invoked by uid 500); 17 Feb 2009 21:42:58 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 38641 invoked by uid 500); 17 Feb 2009 21:42:58 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 38630 invoked by uid 99); 17 Feb 2009 21:42:58 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Feb 2009 13:42:58 -0800 X-ASF-Spam-Status: No, hits=2.4 required=10.0 tests=HTML_MESSAGE,SPF_PASS,WHOIS_MYPRIVREG X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of erickerickson@gmail.com designates 209.85.198.237 as permitted sender) Received: from [209.85.198.237] (HELO rv-out-0506.google.com) (209.85.198.237) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Feb 2009 21:42:47 +0000 Received: by rv-out-0506.google.com with SMTP id g9so1697766rvb.5 for ; Tue, 17 Feb 2009 13:42:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=/4mFjzctfKnHlb4pq+BIJgA9jfMK9b1a7gra3aH8hIk=; b=MFlLb3IRxb3CisuSZ6vWb12lRz1u37SloDjfyfGkaoStNNwBnVCyiKhQqqlXA1PbSJ WuFT/k1vz4B6n9Q9AkzaXuou58eRZODOhq6GYL89F6hWnD2nz1BP4chUSkDz73NTXPU+ F4Xs+/oKh7aG1OwXKdU8RLv9uhOeaol1KehHg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=C8Y9lLWHQOewlKjy+d/akyGHQhLQ9BC9B3jN5KL4ZPPHEGiGd2EIu+0+EK4tGLyXc9 RmcvQoDnxwPNPb+VYC0eKaHR8M5oT4UJT2b4iHh+TZLL6eUdQOPYOr5oBmdYtkJz4e3l DMoDPlVn121T5Tg5nfIhHfzwTzE5kA/Ue2i7U= MIME-Version: 1.0 Received: by 10.141.115.20 with SMTP id s20mr3470676rvm.285.1234906945378; Tue, 17 Feb 2009 13:42:25 -0800 (PST) In-Reply-To: <22066404.post@talk.nabble.com> References: <22044596.post@talk.nabble.com> <359a92830902161253m1bfc4ad0sa86b17ed0027aa17@mail.gmail.com> <22055571.post@talk.nabble.com> <359a92830902170530o77b34a13sc9e225980a5a99bf@mail.gmail.com> <22066404.post@talk.nabble.com> Date: Tue, 17 Feb 2009 16:42:25 -0500 Message-ID: <359a92830902171342g2b45953bm45283dfae2f1e43d@mail.gmail.com> Subject: Re: Querying for a catagory From: Erick Erickson To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=000e0cd15534cf58950463242d2a X-Virus-Checked: Checked by ClamAV on apache.org --000e0cd15534cf58950463242d2a Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit OK, I think I'm getting it, but I'm slow sometimes. The first thing I'd try is to make sure you index the user with each document. Then in you HitCollector.collect, use FieldSelector to load ONLY the user ID from each document and add the score for that doc to that user (you'll have to keep some sort of map, the usual Java variety to record this for different users). Run some timings on this process to insure your performance is adequate. That keeps your extra work to a minimum. If that doesn't work, you could create a map of doc IDs to users that you access in your HitCollector.collect method to see what user to add the current score to. This could be created by using TermDocs/TermEnum at, say, index open time. Since you're not talking a huge index here, this shouldn't be to costly. Best Erick On Tue, Feb 17, 2009 at 4:09 PM, AmigoProgrammer wrote: > > I previous posts I have used document for both a file (e.g. Word or Pdf) > and > a Lucene document. Let me try again: > > A client can have many files but a file only has one client. > > For some queries I am not interested in the individual files that match the > query, but rather in the sum of the score for matching files grouped by > clients. Hence the reference to 'group by'. > > If the index contains three matching documents A, B and C with a score of > 0.2, 0.1 and 0.5 respectively. Where A and B is associated to client X and > C > is associated to client Y. > > The query should ideally return > Y: 0.5 > X: 0.3 (sum of 0.2 and 0.1) > > I have made a small PoC index where all files for a client is added to the > same Lucene document along with the client id as a keyword. This works fine > for the above purpose, but does not allow me to query for individual > documents. Which I am also interested in. > > I haven't built the index yet, but I estimat an index of less than 100.000 > documents. I hope to achieve responce times less that 2 secs. > > Unsure what you mean by 'user'? > > Best, > > Michael > > > > Erick Erickson wrote: > > > > Well, I can imagine several schemes, how suitable they are depends > > upon some as yet unspecified characteristics of your problem space. > > > > You don't want to iterate blindly over the responses in a > > HitCollector.collect method unless your index is quite small (see the > > API docs for an explanation). > > > > If you don't have very many users, you could consider creating a Filter > > at startup time, one for each user with a bit set for each document > > that user has (see TermDocs/TermEnum). > > > > You could *try* FieldSelector (aka Lazy Loading) to make document > > fetching more efficient in your collect method. If you try this be sure > > that your user field is indexed. Again, depending upon your index > > characteristics this may or may not be viable. > > > > Instead of FieldSelector you could try using TermDocs/TermEnum in > > your collect method to see if a user was indexed for a particular > > document. > > > > You could also supply some more details about your index, e.g. number > > of documents, number of users, whether more than one user is allowed > > per document. What response times you require. What the larger problem > > you're trying to solve, that is, what use case are you trying to solve. > > Which > > is another way of asking if this is an XY problem. > > > > Perhaps wiser heads than mine can come up with something clever with > > enough details. > > > > Best > > Erick > > > > On Tue, Feb 17, 2009 at 6:47 AM, AmigoProgrammer > wrote: > > > >> > >> A relevant client is one that is related to one or more documents found > >> by > >> a > >> search. > >> > >> I would store client as a keyword with a document and I would like the > >> query > >> to return clients with the sum of relevant documents score. A client > with > >> many low scoring documents could be as relevant as a client with few > high > >> scoring documents. Basically I am looking for a 'group by'-like > >> functionality. > >> > >> Best, > >> > >> Michael > >> > >> > >> Erick Erickson wrote: > >> > > >> > What constitutes a "relevant client"? If you want > >> > to restrict the returned documents to a particular client > >> > (or even a set of clients) a simple +client: > >> > would do the trick..... > >> > > >> > Or you could create a Filter for "relevant clients". > >> > > >> > If neither of these helps, could you clarify your > >> > definition of a relevant client? > >> > > >> > Best > >> > Erick > >> > > >> > > >> > On Mon, Feb 16, 2009 at 3:00 PM, AmigoProgrammer > >> wrote: > >> > > >> >> > >> >> Hi, > >> >> > >> >> I have a number of documents that each relate to a client. I would > >> like > >> >> to > >> >> use an index and queries to answer two question: > >> >> - Find relevant documents > >> >> - Find relevant clients > >> >> > >> >> The first one is straight forward. > >> >> For the second one, I am wondering. Should I iterate over the hits > and > >> >> compute the most relevant clients. Or is there a clever build-in way > >> of > >> >> answering the question? > >> >> > >> >> Anyone that can help me crack the nut? > >> >> > >> >> Best, > >> >> > >> >> Michael > >> >> -- > >> >> View this message in context: > >> >> > http://www.nabble.com/Querying-for-a-catagory-tp22044596p22044596.html > >> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. > >> >> > >> >> > >> >> --------------------------------------------------------------------- > >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> >> For additional commands, e-mail: java-user-help@lucene.apache.org > >> >> > >> >> > >> > > >> > > >> > >> -- > >> View this message in context: > >> http://www.nabble.com/Querying-for-a-catagory-tp22044596p22055571.html > >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> For additional commands, e-mail: java-user-help@lucene.apache.org > >> > >> > > > > > > -- > View this message in context: > http://www.nabble.com/Querying-for-a-catagory-tp22044596p22066404.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --000e0cd15534cf58950463242d2a--