hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball <aa...@cloudera.com>
Subject Re: Maps running - how to increase?
Date Thu, 06 Aug 2009 17:59:22 GMT
Is that setting in the hadoop-site.xml file on every node? Each tasktracker
reads in that file once and sets its max map tasks from that. There's no way
to control this setting on a per-job basis or from the client (submitting)
system. If you've changed hadoop-site.xml after starting the tasktracker,
you need to restart the tasktracker daemon on each node.

Note that 32 maps/node is considered a *lot*. This will likely not provide
you with optimal throughput, since they'll be competing for cores, RAM, I/O,
etc. ...Unless you've got some really super-charged machines in your
datacenter :grin:

Also, in terms of optimizing your job -- do you really have 6,000 big files
worth reading? Or are you running a job over 6,000 small files (where small
means less than 100 MB or so)? If the latter, consider using
MultiFileInputFormat to allow each task to operate on multiple files. See
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/ for some
more detail. Even after all 6,000 map tasks run, you'll have to deal with
reassembling 6,000 intermediate data shards into 6 or 12 reduce tasks. This
will also be slow, unless you bunch up multiple files into a single task.

Cheers,
- Aaron


On Wed, Aug 5, 2009 at 5:06 PM, Zeev Milin <zeevmisc@gmail.com> wrote:

> I now see that the mapred.tasktracker.map.tasks.maximum=32 on the job level
> and still only 6 maps running and 5000+ pending..
>
> Not sure how to force the cluster to run more maps.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message