Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@mahout.apache.org
From: Grant Ingersoll <gsingers@apache.org>
Mime-Version: 1.0 (Apple Message framework v1251.1)
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_B9789A38-9C60-4782-897B-746C55C1E9C0"
Subject: Re: TopItems.getTopUsers()
Date: Wed, 9 Nov 2011 05:52:50 -0800
In-Reply-To: 
 <CAEccTyxa_qG9fabiQYmcA8-9GeexWWZfNLZ9fLMRB1T1id_1cA@mail.gmail.com>
To: dev@mahout.apache.org
References: <11D9D778-2A6F-4C74-84DF-71FF5B1BC511@apache.org>
 <CAEccTyxa_qG9fabiQYmcA8-9GeexWWZfNLZ9fLMRB1T1id_1cA@mail.gmail.com>
Message-Id: <6D63B23E-2C0D-4245-9B87-8595367BA016@apache.org>

--Apple-Mail=_B9789A38-9C60-4782-897B-746C55C1E9C0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii


On Nov 8, 2011, at 11:24 PM, Sean Owen wrote:

> The PriorityQueue there is a min heap. It's used to keep finding the
> smallest among a collection of large values. So, no it can't be used
> directly to create a list of items from large to small as it has a
> "get smallest" method, not "get largest".

Notice the code in Lucene.  It is solving the exact same problem.  =
Return the top X things by score.  Our ScoreDoc is made of a score and =
an id.

>=20
> The result of the method is a list of IDs, not a list of item-score
> pairs. This is the reason for the last step where it's converted into
> just the IDs.

Yeah, I know.  But why all the shuffling around?  We have all we need on =
the PQ already.

>=20
> The size of queue here is typically really small, like 10 items. The
> queue implementation probably makes no measurable difference at that
> size.

Multiplied by millions or requests, it adds up.  I think Recommenders, =
especially, are to the point where it's time to start tightening the =
screws on performance.   I've been reviewing this stuff quite a bit =
lately and am struck by how similar this stuff is to Lucene's core =
scoring, albeit from a few years ago (pre Lucene 2.4, by my guess).  For =
instance, our "IDRescorer", is modeled via a Collector class and bit set =
filters and it's integrated directly into the scoring, whereas Mahout =
sorts after scoring.


>=20
> On Wed, Nov 9, 2011 at 12:46 AM, Grant Ingersoll <gsingers@apache.org> =
wrote:
>> I've been reading code and am wondering about TopItems.getTopUsers
>>=20
>> Here's my pseudocode of it (lines 96-134 in TopItems)
>> Get a Priority Queue (PQ)
>> for all users
>>        estimate the similarity of user i with our current user (the =
one we are generating a rec for)
>>        put a SimilarUser object on the PQ.
>>=20
>> //HERE
>> Create a list of SimilarUser objects
>> Populate that list with the PQ objects, which also contains =
SimilarUser objects
>> Sort that list
>> Iterate over that list once more and grab the ids and put it into an =
array
>> //HERE
>> return the array
>>=20
>> Why all that in between moving stuff marked by //HERE?  Isn't that =
traversing the same list 3 times?  Can't we just pop from the priority =
queue in reverse order and fill the array from back to front?  Am I =
missing something?
>>=20
>> Here's the same code in Lucene:
>> <snip>
>> protected void populateResults(ScoreDoc[] results, int howMany) {
>>    for (int i =3D howMany - 1; i >=3D 0; i--) {
>>      results[i] =3D pq.pop();
>>    }
>>  }
>> </snip>
>>=20
>> Also, FWIW, Lucene's PQ implementation is faster than Java's, from =
what I understand (which is why we wrote our own).
>>=20
>> -Grant

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com


--Apple-Mail=_B9789A38-9C60-4782-897B-746C55C1E9C0--