mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: LDA on single node is much faster than 20 nodes
Date Tue, 06 Sep 2011 23:28:22 GMT
You can't just set the block size, you need to modify the InputFormat to
change
the number of splits.  For example, you can do:

    FileInputFormat.setMaxInputSplitSize(job, maxSizeInBytes);

and you'll force it to make more splits in your data set, and hence more
mappers.

  -jake

On Tue, Sep 6, 2011 at 4:12 PM, Dhruv Kumar <dkumar@ecs.umass.edu> wrote:

> On Tue, Sep 6, 2011 at 6:57 PM, Chris Lu <clu@atypon.com> wrote:
>
> > Thanks. Very helpful to me!
> >
> > I tried to change the setting of "mapred.map.tasks".  However, the number
> > map task is still just one on one of the 20 machines.
> >
> > ./elastic-mapreduce --create --alive \
> >   --num-instances 20 --name "LDA" \
> >   --bootstrap-action
> s3://elasticmapreduce/**bootstrap-actions/configure-*
> > *hadoop \
> >   --bootstrap-name "Configuring number of map tasks per job" \
> >   --args "-m,mapred.map.tasks=40"
> >
> > Anyone knows how to configure the number of mappers?
> > Again, the input size is only 46M.
> >
> > Chris
> >
> >
> > On 09/06/2011 12:09 PM, Ted Dunning wrote:
> >
> >> Well, I think that using small instances is a disaster in general.  The
> >> performance that you get from them can vary easily by an order of
> >> magnitude.
> >>  My own preference for real work is either m2xl or cc14xl.  The latter
> >> machines give you nearly bare metal performance and no noisy neighbors.
> >>  The
> >> m2xl is typically very much underpriced on the spot market.
> >>
> >> Sean is right about your job being misconfigured.  The Hadoop overhead
> is
> >> considerable and you have only given it two threads to overcome that
> >> overhead.
> >>
> >> On Tue, Sep 6, 2011 at 6:12 PM, Sean Owen<srowen@gmail.com>  wrote:
> >>
> >>  That's your biggest issue, certainly. Only 2 mappers are running, even
> >>> though you have 20 machines available. Hadoop determines the number of
> >>> mappers based on input size, and your input isn't so big that it thinks
> >>> you
> >>> need 20 workers. It's launching 33 reducers, so your cluster is put to
> >>> use
> >>> there. But it's no wonder you're not seeing anything like 20x speedup
> in
> >>> the
> >>> mapper.
> >>>
> >>> You can of course force it to use more mappers, and that's probably a
> >>> good
> >>> idea here. -Dmapred.map.tasks=20 perhaps. More mappers means more
> >>> overhead
> >>> of spinning up mappers to process less data, and Hadoop's guess
> indicates
> >>> that it thinks it's not efficient to use 20 workers. If you know that
> >>> those
> >>> other 18 are otherwise idle, my guess is you'd benefit from just making
> >>> it
> >>> use 20.
> >>>
> >>
>
> Sean,
>
> I too have always been confused about how Hadoop decides to set the number
> of mappers so you could help my understanding here...
>
> Is -Dmapred.map.tasks just a hint to the framework for the number of
> mappers
> (just like using the combiner is a hint) or does it actually set the number
> of workers to that number (provided our input is large enough)?
>
> The reason I ask is because on
> http://wiki.apache.org/hadoop/HowManyMapsAndReduces, it is mentioned that
> the framework uses the HDFS block size to decide on the number of mapper
> workers to be invoked. Should we be setting that parameter instead?
>
>
> >
> >>> If this were a general large cluster where many people are taking
> >>> advantage
> >>> of the workers, then I'd trust Hadoop's guesses until you are sure  you
> >>> want
> >>> to do otherwise.
> >>>
> >>> On Tue, Sep 6, 2011 at 7:02 PM, Chris Lu<clu@atypon.com>  wrote:
> >>>
> >>>  Thanks for all the suggestions!
> >>>>
> >>>> All the inputs are the same. It takes 85 hours for 4 iterations on 20
> >>>> Amazon small machines. On my local single node, it got to iteration
19
> >>>>
> >>> for
> >>>
> >>>> also 85 hours.
> >>>>
> >>>> Here is a section of the Amazon log output.
> >>>> It covers the start of iteration 1, and between iteration 4 and
> >>>> iteration
> >>>> 5.
> >>>>
> >>>> The number of map tasks is set to 2. Should it be larger or related
to
> >>>> number of CPU cores?
> >>>>
> >>>>
> >>>>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message