hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Khalil Honsali" <k.hons...@gmail.com>
Subject Re: setting # of maps for a job
Date Thu, 24 Jan 2008 06:42:52 GMT
Thanks All;

What I am trying to do is have a set of text files (say gutenberg books)
stored on HDFS, processed by hadoop, and web servable by httpd. So I don't
want to have files compressed (or not?) . Now I am writing a custom
MultiFileInputFormat to process a customizable number of files at once
(seems requries also an adequate recordReader?)

I have already tasktracker.map.tasks set to maximum of 20, and still the
value "running" map processes (ie. concurrently?) on job tracker web info
page is 8, that is 2 per node (4n cluster), though it should be equivalent
to 4*20=80 right? regardless of how many maps pending or maps total (i.e.:
regardless of files splits/ block size). Also, I noticed that although 2
"running' process on the webpage, actually 4 tasktracker child processes on
the result of command jps (per node).


K. Honsali

On 24/01/2008, Amar Kamat <amarrk@yahoo-inc.com> wrote:
>
> Khalil Honsali wrote:
> > Hi,
> >
> > I am experiencing a similar problem, even after varying [blocksize],
> > [splitsize] and [num map tasks] in both API and hadoop-site.xml; the num
> of
> > map tasks was 8 instead of expected 20 on a 4 node cluster.
> >
> > I am working with text files, there is an issue about this where the
> > solution suggests to zip the files so that a single zip >> block.
> > http://www.mail-archive.com/hadoop-user@lucene.apache.org/msg02836.html
> >
> > However, I still don't understand two issues:
> > - what is the relations between num files, file size, block size, split
> size
> > and num map tasks.
> >
> Only thing that matters are block size and split size. Block is the
> basic storage unit on the DFS while splits form the basic unit for maps.
> An input file can be split into smaller chunks each of block-size and
> stored on the DFS. Given a InputFormat you define what a split is and
> hence determine the total number of maps. You can provide your own input
> format and control the maps. For example when I wanted to write a code
> for inverted-indexing, I wrote an InputFormat that treats a file as a
> non splittable entity and should be processed as a whole. In that case
> #maps = num files in my input directory.
> > - what if I wanted to serve the text files directly from HDFS to HTTP, I
> > don't want to zip and unzip them each time right? how to configure
> hadoop so
> > that it works best with small files directly (maybe not designed for
> that?)
> >
> >
> What exactly are you trying to achieve?
> > Finally, I wonder if it would be useful to have a tool for estimating
> > optimum performance based on the workload parameters, instead of manual
> > trial/error.
> >
> >
> no idea
> Amar
> > thanks very much!
> >
> >
> > On 23/01/2008, Ted Dunning <tdunning@veoh.com> wrote:
> >
> >>
> >> Setting the number of maps lower than would otherwise be used is useful
> if
> >> you have a job that should not clog up the cluster.  If you don't need
> it
> >> to
> >> run quickly, then you can set m = N / 5 or so and get slow progress
> with
> >> small impact on the throughput of the cluster.
> >>
> >> IF and when hadoop-2573 gets resolve, then there will be a much better
> >> answer for this.
> >>
> >>
> >> On 1/22/08 8:01 PM, "Amar Kamat" <amarrk@yahoo-inc.com> wrote:
> >>
> >>
> >>> Hi,
> >>> You can't directly control the number of maps. Its based on the splits
> >>> of the data residing on the DFS. The number one provides using
> >>> command-line/code/the conf-files are hints to HADOOP. I guess this is
> >>> for the reason that if the #maps (set externally) is less than the
> >>> #splits, we might end up migrating the data which is a performance
> hit.
> >>> There could be other reasons too.
> >>> Amar
> >>> Stefan Groschupf wrote:
> >>>
> >>>> Hi,
> >>>> I have trouble setting the number of maps for a job with version 15.1
> .
> >>>> As far I understand I can configure the number of maps that a job
> will
> >>>> do in an hadoop-site.xml on the box where I submit the job (that is
> >>>> not the jobtracker box).
> >>>> However my configuration is always ignored. Also changing the value
> in
> >>>> the hadoop-site on the jobtracker box and restarting the nodes does
> >>>> not help.
> >>>> Also I do not set the number via API.
> >>>> Any ideas where I might oversee something?
> >>>> Thanks for any hints,
> >>>> Stefan
> >>>>
> >>>>
> >>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>>> 101tec Inc.
> >>>> Menlo Park, California, USA
> >>>> http://www.101tec.com
> >>>>
> >>>>
> >>>>
> >>
> >
> >
> > --
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message