mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: LDA/CVB Performance
Date Thu, 13 Jun 2013 23:02:39 GMT
Yes. but would task tracker check it? i am quite dubious about that.

the only resource manager i know that does these tricks is mesos. And then
framework (such as spark) should then support it too. With spark, indeed,
you can do these things on per-session basis (so called "coarse  grained"
vs. "fine grained" scheduling). And then this probably needs to be
integrated properly with broadcasting mechanism.


On Thu, Jun 13, 2013 at 1:41 PM, Suneel Marthi <suneel_marthi@yahoo.com>wrote:

> You should be able to programmatically override the setting for
> mapred.tasktracker.map.tasks.maximum if its not marked as 'final' in your
> Hadoop setup.
> Check your mapred-site.xml to verify that.
>
> In my env its marked as final so I don't have the luxury of overriding it.
>
>
>
>
> ________________________________
>  From: Sebastian Schelter <ssc@apache.org>
> To: user@mahout.apache.org
> Sent: Thursday, June 13, 2013 4:36 PM
> Subject: Re: LDA/CVB Performance
>
>
> I'd have to recheck with my colleague, but I'm pretty sure it worked
> when we tried it. What should prevent the jobtracker from scheduling
> only one map task per machine/tasktracker for a specific job?
>
>
> -sebastian
>
> On 13.06.2013 22:31, Alan Gardner wrote:
> > mapred.tasktracker.map.tasks.maximum is loaded at tasktracker load time,
> I
> > don't think you can configure it per job. More granular resource control
> is
> > a job for Mesos or YARN, MR doesn't support this sort of thing.
> >
> > I think for our deployment we'll carve out a chunk of the cluster, sized
> > and configured exclusively to do ML 24/7. If we don't need that much
> > capacity, it might be better to spin up an Elastic Map Reduce cluster
> for a
> > few hours every day.
> >
> >
> > On Thu, Jun 13, 2013 at 4:20 PM, Sebastian Schelter <ssc@apache.org>
> wrote:
> >
> >> On 13.06.2013 22:12, Dmitriy Lyubimov wrote:
> >>> On Thu, Jun 13, 2013 at 1:00 PM, Sebastian Schelter <ssc@apache.org>
> >> wrote:
> >>>
> >>>> This table is readonly, right? We could try to apply the trick from
> our
> >>>> ALS code: Instead of running one mapper per core (and thus having one
> >>>> copy of the table per core), run a multithreaded mapper and share the
> >>>> table between its threads. Works very well for ALS.
> >>>
> >>>
> >>> Just out of my ignorance, how will you tell MR that your mapper is
> using
> >>> more than 1  core and that it doesn't have to run more than 1 mapepr of
> >>> that time per box?
> >>
> >> You need to use a MultithreadedMapper for which you can set the size of
> >> the thread pool via MultithreadedMapper.setNumberOfThreads(...)
> >>
> >> You can configure the maximum number of mappers to run per task tracker
> >> with -Dmapred.tasktracker.map.tasks.maximum=x
> >>
> >>>
> >>>
> >>>> We can also cache
> >>>> the table in a static variable and make Hadoop reuse JVMs, which
> >>>> increases performance if the number of blocks to process is larger
> than
> >>>> the number of map slots.
> >>>>
> >>>
> >>> This usually (or might be) something the admin doesn't let us override.
> >>> Also i am not sure if jvm reuse in hadoop is isolated between different
> >>> jobs (so other tasks may inherit that stuff they probably don't want)
> >>
> >> The jvm is only reused during a single job.
> >>
> >>>>
> >>>> -sebastian
> >>>>
> >>>> On 13.06.2013 21:56, Ted Dunning wrote:
> >>>>> On Thu, Jun 13, 2013 at 6:50 PM, Jake Mannix <jake.mannix@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>>> Andy, note that he said he's running with a 1.6M-term dictionary.
> >>>>  That's
> >>>>>> going
> >>>>>> to be 2 * 200 * 1.6M * 8B = 5.1GB for just the term-topic matrices.
> >>>> Still
> >>>>>> not hitting
> >>>>>> 8GB, but getting closer.
> >>>>>>
> >>>>>
> >>>>> It will likely be even worse unless this table is shared between
> >> mappers.
> >>>>>  With 8 mappers per node, this goes to 41GB.  The OP didn't mention
> >>>> machine
> >>>>> configuration, but this could easily cause swapping.
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message