Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E84EB7D52 for ; Wed, 9 Nov 2011 13:52:52 +0000 (UTC) Received: (qmail 23438 invoked by uid 500); 9 Nov 2011 13:52:52 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 23342 invoked by uid 500); 9 Nov 2011 13:52:52 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 23332 invoked by uid 99); 9 Nov 2011 13:52:52 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Nov 2011 13:52:52 +0000 Received: from localhost (HELO [192.168.101.91]) (127.0.0.1) (smtp-auth username gsingers, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Nov 2011 13:52:52 +0000 From: Grant Ingersoll Mime-Version: 1.0 (Apple Message framework v1251.1) Content-Type: multipart/alternative; boundary="Apple-Mail=_B9789A38-9C60-4782-897B-746C55C1E9C0" Subject: Re: TopItems.getTopUsers() Date: Wed, 9 Nov 2011 05:52:50 -0800 In-Reply-To: To: dev@mahout.apache.org References: <11D9D778-2A6F-4C74-84DF-71FF5B1BC511@apache.org> Message-Id: <6D63B23E-2C0D-4245-9B87-8595367BA016@apache.org> X-Mailer: Apple Mail (2.1251.1) --Apple-Mail=_B9789A38-9C60-4782-897B-746C55C1E9C0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii On Nov 8, 2011, at 11:24 PM, Sean Owen wrote: > The PriorityQueue there is a min heap. It's used to keep finding the > smallest among a collection of large values. So, no it can't be used > directly to create a list of items from large to small as it has a > "get smallest" method, not "get largest". Notice the code in Lucene. It is solving the exact same problem. = Return the top X things by score. Our ScoreDoc is made of a score and = an id. >=20 > The result of the method is a list of IDs, not a list of item-score > pairs. This is the reason for the last step where it's converted into > just the IDs. Yeah, I know. But why all the shuffling around? We have all we need on = the PQ already. >=20 > The size of queue here is typically really small, like 10 items. The > queue implementation probably makes no measurable difference at that > size. Multiplied by millions or requests, it adds up. I think Recommenders, = especially, are to the point where it's time to start tightening the = screws on performance. I've been reviewing this stuff quite a bit = lately and am struck by how similar this stuff is to Lucene's core = scoring, albeit from a few years ago (pre Lucene 2.4, by my guess). For = instance, our "IDRescorer", is modeled via a Collector class and bit set = filters and it's integrated directly into the scoring, whereas Mahout = sorts after scoring. >=20 > On Wed, Nov 9, 2011 at 12:46 AM, Grant Ingersoll = wrote: >> I've been reading code and am wondering about TopItems.getTopUsers >>=20 >> Here's my pseudocode of it (lines 96-134 in TopItems) >> Get a Priority Queue (PQ) >> for all users >> estimate the similarity of user i with our current user (the = one we are generating a rec for) >> put a SimilarUser object on the PQ. >>=20 >> //HERE >> Create a list of SimilarUser objects >> Populate that list with the PQ objects, which also contains = SimilarUser objects >> Sort that list >> Iterate over that list once more and grab the ids and put it into an = array >> //HERE >> return the array >>=20 >> Why all that in between moving stuff marked by //HERE? Isn't that = traversing the same list 3 times? Can't we just pop from the priority = queue in reverse order and fill the array from back to front? Am I = missing something? >>=20 >> Here's the same code in Lucene: >> >> protected void populateResults(ScoreDoc[] results, int howMany) { >> for (int i =3D howMany - 1; i >=3D 0; i--) { >> results[i] =3D pq.pop(); >> } >> } >> >>=20 >> Also, FWIW, Lucene's PQ implementation is faster than Java's, from = what I understand (which is why we wrote our own). >>=20 >> -Grant -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com --Apple-Mail=_B9789A38-9C60-4782-897B-746C55C1E9C0--