mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: ItemSimilarityJob
Date Thu, 12 Aug 2010 19:30:40 GMT
Jimmy Lin's stripes work was presented at the last Summit and there was
heated (well, warm and cordial at least) discussion with the Map-reduce
committers about whether good use of a combiner wouldn't do just as well.

My take-away as a spectator is that a combiner was

a) vastly easier to code

b) would be pretty certain to be within 2x as performant and likely very
close to the same speed

c) would not need changing each time the underlying map-reduce changed

My conclusion was that combiners were the way to go (for me).  Your mileage,
as always, will vary.

On Thu, Aug 12, 2010 at 7:45 AM, Gökhan Çapan <gkhncpn@gmail.com> wrote:

> Hi,
> I haven't seen the code, but maybe Mahout needs some optimization while
> computing item-item co-occurrences. It may be re-implemented using
> "stripes"
> approach using in-mapper combining if it is not. It can be found at:
>
>   1. www.aclweb.org/anthology/D/D08/D08-1044.pdf
>
> If it already is, sorry for the post.
>
> On Thu, Aug 12, 2010 at 3:51 PM, Charly Lizarralde <
> charly.lizarralde@gmail.com> wrote:
>
> > Sebastian, thank's for the reply.  The step name is*
> > :*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.  and each
> > map task
> > takes around 10 hours to finish.
> >
> > Reduce task dir
> >
> >
> (var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output)
> > has map output files ( files like map_2.out) and each one is 5GB in size.
> >
> > I have been looking at the code and saw what you describe in the e-mail.
> It
> > makes sense. But still 160 GB of intermediate info from a 2.6 GB input
> file
> > still makes me wonder if something is wrong.
> >
> > Should I just wait for the patch?
> > Thanks again!
> > Charly
> >
> > On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter <
> > ssc.open@googlemail.com
> > > wrote:
> >
> > > Hi Charly,
> > >
> > > can you tell which Map/Reduce step was executed last before you ran out
> > > of disk space?
> > >
> > > I'm not familiar with the Netflix dataset and can only guess what
> > > happened, but I would say that you ran out of diskspace because
> > > ItemSimilarityJob currently uses all preferences to compute the
> > > similarities. This makes it scale in the square of the number of
> > > occurrences of the most popular item, which is a bad thing if that
> > > number is huge. We need a way to limit the number of preferences
> > > considered per item, there is already a ticket for this (
> > > https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to
> provide
> > > a patch in the next days.
> > >
> > > --sebastian
> > >
> > >
> > >
> > > Am 12.08.2010 00:15, schrieb Charly Lizarralde:
> > > > Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB) and I
> > have
> > > > just ran out of disk space (160 GB) in my mapred.local.dir when
> running
> > > > RowSimilarityJob.
> > > >
> > > > Is this normal behaviour? How can I improve this?
> > > >
> > > > Thanks!
> > > > Charly
> > > >
> > > >
> > >
> > >
> >
>
>
>
> --
> Gökhan Çapan
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message