hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Mapping one key per Map Task
Date Mon, 23 May 2011 16:50:46 GMT

You _might_ lose locality by splitting beyond the block splits, and
the tasks although better 'parallelized', may only end up performing
worse. A good way to instead increase task #s is to go the block size
way (lower block size, getting more splits at the cost of little extra
NN space). After all, block sizes are per-file properties.

But if you'd really want to go the per-record way, an NLine-like
implementation for SequenceFiles using what Joey and Jason have
pointed out - would be the best way. (NLineInputFormat doesn't cover
the use of SequenceFiles directly - Its implemented with a LineReader)

On Mon, May 23, 2011 at 2:39 PM, Vincent Xue <xue.vin@gmail.com> wrote:
> Hello Hadoop Users,
> I would like to know if anyone has ever tried splitting an input
> sequence file by key instead of by size. I know that this is unusual
> for the map reduce paradigm but I am in a situation where I need to
> perform some large tasks on each key pair in a load balancing like
> fashion.
> To describe in more detail:
> I have one sequence file of 2000 key value pairs. I want to distribute
> each key value pair to a map task where it will perform a series of
> Map/Reduce tasks. This means that the Map task is calling a series of
> Jobs. Once the Jobs in each task is complete, I want to reduce all of
> the output into one sequence file.
> I am stuck in which I am limited by the number of splits a sequence
> file is handled in. Hadoop only splits my sequence file into 80 map
> tasks when I can perform around 250 map tasks on my cluster. This
> means that I am not fully utilizing my cluster, and my Job will not
> scale.
> Can anyone shed some light on this problem. I have tried looking at
> the InputFormats but I am not sure if this is where I should continue
> looking.
> Best Regards
> Vincent

Harsh J

View raw message