hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Moustafa Gaber <moustafa.ga...@gmail.com>
Subject Re: Mapping one key per Map Task
Date Tue, 24 May 2011 04:06:23 GMT
I think you don't need to split your input file so that each map is assigned
one key. Your goal is to make load balance. For each map task of yours, it
will initiate a new MR sub-job. This sub-job will be assigned a new
master/workers, which means the map task of the sub-job may be scheduled to
work on other machines rather than the ones your parent map tasks are
working on. Therefore, you can still achieve load balance without splitting
your input file per one key.

On Mon, May 23, 2011 at 1:02 PM, Vincent Xue <xue.vin@gmail.com> wrote:

> Thanks for the suggestions!
>
> On Mon, May 23, 2011 at 5:50 PM, Harsh J <harsh@cloudera.com> wrote:
> > Vincent,
> >
> > You _might_ lose locality by splitting beyond the block splits, and
> > the tasks although better 'parallelized', may only end up performing
> > worse. A good way to instead increase task #s is to go the block size
> > way (lower block size, getting more splits at the cost of little extra
> > NN space). After all, block sizes are per-file properties.
> >
> > But if you'd really want to go the per-record way, an NLine-like
> > implementation for SequenceFiles using what Joey and Jason have
> > pointed out - would be the best way. (NLineInputFormat doesn't cover
> > the use of SequenceFiles directly - Its implemented with a LineReader)
> >
> > On Mon, May 23, 2011 at 2:39 PM, Vincent Xue <xue.vin@gmail.com> wrote:
> >> Hello Hadoop Users,
> >>
> >> I would like to know if anyone has ever tried splitting an input
> >> sequence file by key instead of by size. I know that this is unusual
> >> for the map reduce paradigm but I am in a situation where I need to
> >> perform some large tasks on each key pair in a load balancing like
> >> fashion.
> >>
> >> To describe in more detail:
> >> I have one sequence file of 2000 key value pairs. I want to distribute
> >> each key value pair to a map task where it will perform a series of
> >> Map/Reduce tasks. This means that the Map task is calling a series of
> >> Jobs. Once the Jobs in each task is complete, I want to reduce all of
> >> the output into one sequence file.
> >>
> >> I am stuck in which I am limited by the number of splits a sequence
> >> file is handled in. Hadoop only splits my sequence file into 80 map
> >> tasks when I can perform around 250 map tasks on my cluster. This
> >> means that I am not fully utilizing my cluster, and my Job will not
> >> scale.
> >>
> >> Can anyone shed some light on this problem. I have tried looking at
> >> the InputFormats but I am not sure if this is where I should continue
> >> looking.
> >>
> >> Best Regards
> >> Vincent
> >>
> >
> >
> >
> > --
> > Harsh J
> >
>



-- 
Best Regards,
Mostafa Ead

Mime
View raw message