mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason L Shaw <jls...@uw.edu>
Subject Re: Getting InMemBuilder to use more mappers
Date Fri, 30 Mar 2012 15:51:46 GMT
Well, it looks like there's no solution for me right now.

mapred.map.tasks is indeed just a suggestion -- no effect
mapred.max.split.size does not exist as an option, at least according to
http://hadoop.apache.org/common/docs/current/mapred-default.html, and when
I tried it -- no effect
Splitting up my input may or may not work, except that the Decision Forest
code in Mahout cannot currently train on multiple input files:
https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation

So I think I will need to look for another solution to my problem.  I could
just train many small decision forests and then combine them -- does Mahout
provide a slick way to combine the predictions of multiple models?

On Fri, Mar 30, 2012 at 12:54 AM, deneche abdelhakim <adeneche@gmail.com>wrote:

> -Dmapred.map.tasks=N only gives a suggestion to Hadoop, and in most
> cases (especially when the data is small) Hadoop doesn't take it into
> consideration. To generate more mappers use -Dmapred.max.split.size=S,
> S being the size of each data partition in bytes. So your data ~
> 31000000B, if you want to generate 100 partitions (mappers), S should
> be 310000 (31000000/100).
>
>
>
> On Thu, Mar 29, 2012 at 11:08 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > Split your training data into lots of little files.  Depending on the
> wind,
> > that may cause more mappers to be invoked.
> >
> > On Thu, Mar 29, 2012 at 3:05 PM, Jason L Shaw <jlshaw@uw.edu> wrote:
> >
> > > Suggestion, indeed.  I passed that option, but still only 2 mappers
> were
> > > created.
> > >
> > > On Thu, Mar 29, 2012 at 5:23 PM, Sean Owen <srowen@gmail.com> wrote:
> > >
> > > > Hadoop is what chooses the number of mappers, and it bases it on
> input
> > > > size. Generally it will not assign less than one worker per chunk
> and a
> > > > chunk is usually 64MB (still, I believe). You can override this
> > directly
> > > > (well, at least, register a suggestion to Hadoop). I would tell you
> the
> > > > exact flag but I'm not next to my computer. In older Hadoop verisons
> it
> > > was
> > > > -Dmapred.map.tasks=N I believe; in newer versions it's different,
> > perhaps
> > > > -Dmapreduce.map.tasks=N. That's what you're looking for to start.
> There
> > > are
> > > > other ways to influence this like the minimum split size but try this
> > > > first.
> > > >
> > > > On Thu, Mar 29, 2012 at 9:59 PM, Jason L Shaw <jlshaw@uw.edu> wrote:
> > > >
> > > > > I have a dataset that is not terribly large (~31 MB on disk in
> > > plaintext,
> > > > > ~145,000 records with 26 fields).  I am trying to build random
> > forests
> > > > over
> > > > > the data, but the process is quite slow.  It takes about half an
> hour
> > > to
> > > > > build 100 trees using the partial implementation. (I didn't
> realize I
> > > > > didn't need it.)
> > > > >
> > > > > I tried switching to the in-memory implementation so that the trees
> > > would
> > > > > be built in parallel.  I have access to a cluster with about 15
> nodes
> > > and
> > > > > which can support up to 130 mappers.  It seems to me that I ought
> to
> > be
> > > > > able to build 100 trees all at once and be done in less than a
> minute
> > > > (for
> > > > > the building phase, anyway).  However, the job only generated 2
> > > mappers,
> > > > > each building 50 trees, and it took 15 minutes.  I tried again with
> > 500
> > > > > trees, but again only 2 mappers were started.
> > > > >
> > > > > Is there any way I can convince Hadoop to start up more mappers and
> > > load
> > > > > the data more times?  I'm not that familiar with Hadoop, but from
> > what
> > > > I've
> > > > > read, the number of mappers doesn't seem very configurable.  Memory
> > is
> > > > not
> > > > > a concern. (Typically 72 GB or more available.)
> > > > >
> > > > > Thanks,
> > > > > Jason
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message