mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: RecommenderJob and NaN
Date Tue, 11 Oct 2011 19:15:53 GMT

On Oct 11, 2011, at 2:54 PM, Sean Owen wrote:

> NaN is added for all user item pairs that already exist in the input, to
> make them ineligible for recommendation. That's normal - could this be the
> case?

Trying to track down.  I don't think it is the self case, but not 100% sure.  

> On Oct 11, 2011 7:49 PM, "Grant Ingersoll" <gsingers@apache.org> wrote:
> 
>> 
>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
>> 
>>> Where is the NaN coming up -- what has this value?
>> 
>> simColumn seems to be the originator in the Aggregate step.  For instance,
>> my current breakpoint shows:
>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
>> 
>> I can also see some in the PartialMultiplyMapper via the
>> similarityMatrixColumn.
>> 
>> Is that set by SimilarityMatrixRowWrapperMapper?
>> <code>
>> /* remove self similarity */
>>   similarityMatrixRow.set(key.get(), Double.NaN);
>> </code>
>> 
>> 
>> 
>>> It should be propagated in some cases but not others. I'm not aware of
>>> any changes here.
>> 
>> yeah, me neither.  This is all related to MAHOUT-798.
>> 
>>> 
>>> Generally small data sets will have this problem of not being able to
>>> compute much of anything useful, so NaN might be right here.
>>> But you say it was different recently, which seems to rule that out.
>> 
>> I also _believe_ I'm seeing it in a much larger data set on Hadoop, it's
>> just that's a whole lot harder to debug.
>> 
>>> 
>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <gsingers@apache.org>
>> wrote:
>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not
>> getting any recommendations due to NaNs being calculated in the
>> AggregateAndRecommend step.  I'm not quite sure what is going on as it seems
>> like this was working as little as two weeks ago (post Sebastian's big
>> change to RecJob), but I don't see a whole lot of changes in that part of
>> the code.
>>>> 
>>>> The data is user id's mapping to email thread ids.  My input data is
>> simply a triple of user id, thread id, 1 (meaning that user participated in
>> that thread)  It seems like I will have a lot of good values in the inputs
>> to the AggregateAndRecommend step, except one id will be NaN and this then
>> seems to get added in and makes everything NaN (I realize this is a very
>> naive understanding).  I sense that I should be looking upstream in the
>> process for a fix, but I am not sure where that is.
>>>> 
>>>> Any ideas where I should be looking to eliminate these NaNs?  If you
>> want to try this with a small data set, you can get it here:
>> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(but note
the companion article is not published yet.)
>>>> 
>>>> Thanks,
>>>> Grant
>> 
>> 
>> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message