hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Venner <jason.had...@gmail.com>
Subject Re: Processing 10MB files in Hadoop
Date Thu, 26 Nov 2009 18:14:04 GMT
Are the record processing steps bound by a local machine resource - cpu,
disk io or other?

What I often do when I have lots of small files to handle is use the
NlineInputFormat, as data locality for the input files is a much lessor
issue than short task run times in that case,
Each line of my input file would be one of the small files, and then I would
set the number of files per split to be some reasonable number.

If the individual record processing is not bound by local resources you may
wish to try the MultithreadedMapRunner, which gives you a lot of flexibily
about the number of map executions you run in parallel without needing to
restart your cluster to change the tasks per tracker.


On Thu, Nov 26, 2009 at 8:05 AM, Jeff Zhang <zjffdu@gmail.com> wrote:

> Quote from the wiki doc
>
> *The number of map tasks can also be increased manually using the
> JobConf<http://wiki.apache.org/hadoop/JobConf>'s
> conf.setNumMapTasks(int num). This can be used to increase the number of
> map
> tasks, but will not set the number below that which Hadoop determines via
> splitting the input data.*
>
> So the number of map task is determited by InputFormat.
> But you can manually set the number of reducer task to improve the
> performance, because the default number of reducer task is 1
>
>
> Jeff Zhang
>
> On Thu, Nov 26, 2009 at 7:58 AM, CubicDesign <cubicdesign@gmail.com>
> wrote:
>
> > But the documentation DO recommend to set it:
> > http://wiki.apache.org/hadoop/HowManyMapsAndReduces
> >
> >
> >
> > PS: I am using streaming
> >
> >
> >
> >
> > Jeff Zhang wrote:
> >
> >> Actually, you do not need to set the number of map task, the InputFormat
> >> will compute it for you according your input data set.
> >>
> >> Jeff Zhang
> >>
> >>
> >> On Thu, Nov 26, 2009 at 7:39 AM, CubicDesign <cubicdesign@gmail.com>
> >> wrote:
> >>
> >>
> >>
> >>>  The number of mapper is determined by your InputFormat.
> >>>
> >>>
> >>>> In common case, if file is smaller than one block size (which is 64M
> by
> >>>> default), one mapper for this file. if file is larger than one block
> >>>> size,
> >>>> hadoop will split this large file, and the number of mapper for this
> >>>> file
> >>>> will be ceiling ( (size of file)/(size of block) )
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>> Hi
> >>>
> >>> Do you mean, I should set the number of map tasks to 1 ????
> >>> I want to process this file not in a single node but over the entire
> >>> cluster. I need a lot of processing power in order to finish the job in
> >>> hours instead of days.
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >
>



-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message