mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: RecommenderJob and NaN
Date Thu, 13 Oct 2011 06:33:30 GMT
Is this job working well for anyone now?
When was the last time this job worked for someone?

On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <gsingers@apache.org>wrote:

> Both local and on EC2
>
> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:
>
> > Hi Grant,
> >
> > Just curious, are you running this locally or distributed?
> >
> > I'd run into a similar issue, though in a completely different algorithm
> (Jimmy Lin's PageRank implementation) due to the use of a static variable.
> >
> > When running locally, this wasn't getting cleared between loops, and thus
> I got wonky results.
> >
> > The same thing would have happened with JVM reuse enabled.
> >
> > -- Ken
> >
> > On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
> >
> >> Digging some more:
> >>
> >> In AggregateAndRecommend, around lines 143, I have, for userId 0, a
> simColumn of:
> >>
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
> >>
> >> Which then becomes the numerator and the denom.
> >>
> >> Looping, my next simCol is:
> >>
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
> >>
> >> and then
> >>
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
> >>
> >> ...
> >>
> >> Each time, those are getting added into the numerators/denoms value,
> such that by the time we are done looping (line 161), we have:
> >> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
> >> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
> >>
> >> numberOfSimilarItemsUsed:
> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
> >>
> >> Not sure on how to interpret this as I haven't dug into the math here
> yet or figured out where those NaN are coming from originally.
> >>
> >> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
> >>
> >>>
> >>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
> >>>
> >>>>
> >>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
> >>>>
> >>>>> Where is the NaN coming up -- what has this value?
> >>>>
> >>>> simColumn seems to be the originator in the Aggregate step.  For
> instance, my current breakpoint shows:
> >>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
> >>>>
> >>>> I can also see some in the PartialMultiplyMapper via the
> similarityMatrixColumn.
> >>>>
> >>>> Is that set by SimilarityMatrixRowWrapperMapper?
> >>>> <code>
> >>>> /* remove self similarity */
> >>>> similarityMatrixRow.set(key.get(), Double.NaN);
> >>>> </code>
> >>>
> >>> Ah, but that is just taking care of itself, so maybe not the issue.
> >>>
> >>>>
> >>>>
> >>>>
> >>>>> It should be propagated in some cases but not others. I'm not aware
> of
> >>>>> any changes here.
> >>>>
> >>>> yeah, me neither.  This is all related to MAHOUT-798.
> >>>>
> >>>>>
> >>>>> Generally small data sets will have this problem of not being able
to
> >>>>> compute much of anything useful, so NaN might be right here.
> >>>>> But you say it was different recently, which seems to rule that
out.
> >>>>
> >>>> I also _believe_ I'm seeing it in a much larger data set on Hadoop,
> it's just that's a whole lot harder to debug.
> >>>>
> >>>>>
> >>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <
> gsingers@apache.org> wrote:
> >>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and
am not
> getting any recommendations due to NaNs being calculated in the
> AggregateAndRecommend step.  I'm not quite sure what is going on as it seems
> like this was working as little as two weeks ago (post Sebastian's big
> change to RecJob), but I don't see a whole lot of changes in that part of
> the code.
> >>>>>>
> >>>>>> The data is user id's mapping to email thread ids.  My input
data is
> simply a triple of user id, thread id, 1 (meaning that user participated in
> that thread)  It seems like I will have a lot of good values in the inputs
> to the AggregateAndRecommend step, except one id will be NaN and this then
> seems to get added in and makes everything NaN (I realize this is a very
> naive understanding).  I sense that I should be looking upstream in the
> process for a fix, but I am not sure where that is.
> >>>>>>
> >>>>>> Any ideas where I should be looking to eliminate these NaNs?
 If you
> want to try this with a small data set, you can get it here:
> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(but note the
companion article is not published yet.)
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Grant
> >>>>
> >>>>
> >>>
> >>> --------------------------------------------
> >>> Grant Ingersoll
> >>> http://www.lucidimagination.com
> >>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >>>
> >>
> >> --------------------------------------------
> >> Grant Ingersoll
> >> http://www.lucidimagination.com
> >> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >>
> >
> > --------------------------
> > Ken Krugler
> > +1 530-210-6378
> > http://bixolabs.com
> > custom big data solutions & training
> > Hadoop, Cascading, Mahout & Solr
> >
> >
> >
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>
>


-- 
Lance Norskog
goksron@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message