hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Khalil Honsali" <k.hons...@gmail.com>
Subject Re: setting # of maps for a job
Date Wed, 23 Jan 2008 06:22:59 GMT
Hi,

I am experiencing a similar problem, even after varying [blocksize],
[splitsize] and [num map tasks] in both API and hadoop-site.xml; the num of
map tasks was 8 instead of expected 20 on a 4 node cluster.

I am working with text files, there is an issue about this where the
solution suggests to zip the files so that a single zip >> block.
http://www.mail-archive.com/hadoop-user@lucene.apache.org/msg02836.html

However, I still don't understand two issues:
- what is the relations between num files, file size, block size, split size
and num map tasks.
- what if I wanted to serve the text files directly from HDFS to HTTP, I
don't want to zip and unzip them each time right? how to configure hadoop so
that it works best with small files directly (maybe not designed for that?)

Finally, I wonder if it would be useful to have a tool for estimating
optimum performance based on the workload parameters, instead of manual
trial/error.


thanks very much!


On 23/01/2008, Ted Dunning <tdunning@veoh.com> wrote:
>
>
>
> Setting the number of maps lower than would otherwise be used is useful if
> you have a job that should not clog up the cluster.  If you don't need it
> to
> run quickly, then you can set m = N / 5 or so and get slow progress with
> small impact on the throughput of the cluster.
>
> IF and when hadoop-2573 gets resolve, then there will be a much better
> answer for this.
>
>
> On 1/22/08 8:01 PM, "Amar Kamat" <amarrk@yahoo-inc.com> wrote:
>
> > Hi,
> > You can't directly control the number of maps. Its based on the splits
> > of the data residing on the DFS. The number one provides using
> > command-line/code/the conf-files are hints to HADOOP. I guess this is
> > for the reason that if the #maps (set externally) is less than the
> > #splits, we might end up migrating the data which is a performance hit.
> > There could be other reasons too.
> > Amar
> > Stefan Groschupf wrote:
> >> Hi,
> >> I have trouble setting the number of maps for a job with version 15.1.
> >> As far I understand I can configure the number of maps that a job will
> >> do in an hadoop-site.xml on the box where I submit the job (that is
> >> not the jobtracker box).
> >> However my configuration is always ignored. Also changing the value in
> >> the hadoop-site on the jobtracker box and restarting the nodes does
> >> not help.
> >> Also I do not set the number via API.
> >> Any ideas where I might oversee something?
> >> Thanks for any hints,
> >> Stefan
> >>
> >>
> >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >> 101tec Inc.
> >> Menlo Park, California, USA
> >> http://www.101tec.com
> >>
> >>
> >
>
>


--

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message