mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: TopItems.getTopUsers()
Date Wed, 09 Nov 2011 13:52:50 GMT

On Nov 8, 2011, at 11:24 PM, Sean Owen wrote:

> The PriorityQueue there is a min heap. It's used to keep finding the
> smallest among a collection of large values. So, no it can't be used
> directly to create a list of items from large to small as it has a
> "get smallest" method, not "get largest".

Notice the code in Lucene.  It is solving the exact same problem.  Return the top X things
by score.  Our ScoreDoc is made of a score and an id.

> The result of the method is a list of IDs, not a list of item-score
> pairs. This is the reason for the last step where it's converted into
> just the IDs.

Yeah, I know.  But why all the shuffling around?  We have all we need on the PQ already.

> The size of queue here is typically really small, like 10 items. The
> queue implementation probably makes no measurable difference at that
> size.

Multiplied by millions or requests, it adds up.  I think Recommenders, especially, are to
the point where it's time to start tightening the screws on performance.   I've been reviewing
this stuff quite a bit lately and am struck by how similar this stuff is to Lucene's core
scoring, albeit from a few years ago (pre Lucene 2.4, by my guess).  For instance, our "IDRescorer",
is modeled via a Collector class and bit set filters and it's integrated directly into the
scoring, whereas Mahout sorts after scoring.

> On Wed, Nov 9, 2011 at 12:46 AM, Grant Ingersoll <> wrote:
>> I've been reading code and am wondering about TopItems.getTopUsers
>> Here's my pseudocode of it (lines 96-134 in TopItems)
>> Get a Priority Queue (PQ)
>> for all users
>>        estimate the similarity of user i with our current user (the one we are generating
a rec for)
>>        put a SimilarUser object on the PQ.
>> //HERE
>> Create a list of SimilarUser objects
>> Populate that list with the PQ objects, which also contains SimilarUser objects
>> Sort that list
>> Iterate over that list once more and grab the ids and put it into an array
>> //HERE
>> return the array
>> Why all that in between moving stuff marked by //HERE?  Isn't that traversing the
same list 3 times?  Can't we just pop from the priority queue in reverse order and fill the
array from back to front?  Am I missing something?
>> Here's the same code in Lucene:
>> <snip>
>> protected void populateResults(ScoreDoc[] results, int howMany) {
>>    for (int i = howMany - 1; i >= 0; i--) {
>>      results[i] = pq.pop();
>>    }
>>  }
>> </snip>
>> Also, FWIW, Lucene's PQ implementation is faster than Java's, from what I understand
(which is why we wrote our own).
>> -Grant

Grant Ingersoll

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message