mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: RecommenderJob and NaN
Date Wed, 12 Oct 2011 13:28:15 GMT
Digging some more:

In AggregateAndRecommend, around lines 143, I have, for userId 0, a simColumn of:
{22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}

Which then becomes the numerator and the denom.

Looping, my next simCol is:
{22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}

and then
{22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}

...

Each time, those are getting added into the numerators/denoms value, such that by the time
we are done looping (line 161), we have:
numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}

numberOfSimilarItemsUsed: {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}

Not sure on how to interpret this as I haven't dug into the math here yet or figured out where
those NaN are coming from originally.

On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:

> 
> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
> 
>> 
>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
>> 
>>> Where is the NaN coming up -- what has this value?
>> 
>> simColumn seems to be the originator in the Aggregate step.  For instance, my current
breakpoint shows:
>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
>> 
>> I can also see some in the PartialMultiplyMapper via the similarityMatrixColumn.
 
>> 
>> Is that set by SimilarityMatrixRowWrapperMapper?
>> <code>
>> /* remove self similarity */
>>   similarityMatrixRow.set(key.get(), Double.NaN);
>> </code>
> 
> Ah, but that is just taking care of itself, so maybe not the issue.
> 
>> 
>> 
>> 
>>> It should be propagated in some cases but not others. I'm not aware of
>>> any changes here.
>> 
>> yeah, me neither.  This is all related to MAHOUT-798.
>> 
>>> 
>>> Generally small data sets will have this problem of not being able to
>>> compute much of anything useful, so NaN might be right here.
>>> But you say it was different recently, which seems to rule that out.
>> 
>> I also _believe_ I'm seeing it in a much larger data set on Hadoop, it's just that's
a whole lot harder to debug.
>> 
>>> 
>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <gsingers@apache.org>
wrote:
>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not getting
any recommendations due to NaNs being calculated in the AggregateAndRecommend step.  I'm not
quite sure what is going on as it seems like this was working as little as two weeks ago (post
Sebastian's big change to RecJob), but I don't see a whole lot of changes in that part of
the code.
>>>> 
>>>> The data is user id's mapping to email thread ids.  My input data is simply
a triple of user id, thread id, 1 (meaning that user participated in that thread)  It seems
like I will have a lot of good values in the inputs to the AggregateAndRecommend step, except
one id will be NaN and this then seems to get added in and makes everything NaN (I realize
this is a very naive understanding).  I sense that I should be looking upstream in the process
for a fix, but I am not sure where that is.
>>>> 
>>>> Any ideas where I should be looking to eliminate these NaNs?  If you want
to try this with a small data set, you can get it here: http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout
(but note the companion article is not published yet.)
>>>> 
>>>> Thanks,
>>>> Grant
>> 
>> 
> 
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com


Mime
View raw message