hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vincent Xue <xue....@gmail.com>
Subject Mapping one key per Map Task
Date Mon, 23 May 2011 09:09:12 GMT
Hello Hadoop Users,

I would like to know if anyone has ever tried splitting an input
sequence file by key instead of by size. I know that this is unusual
for the map reduce paradigm but I am in a situation where I need to
perform some large tasks on each key pair in a load balancing like
fashion.

To describe in more detail:
I have one sequence file of 2000 key value pairs. I want to distribute
each key value pair to a map task where it will perform a series of
Map/Reduce tasks. This means that the Map task is calling a series of
Jobs. Once the Jobs in each task is complete, I want to reduce all of
the output into one sequence file.

I am stuck in which I am limited by the number of splits a sequence
file is handled in. Hadoop only splits my sequence file into 80 map
tasks when I can perform around 250 map tasks on my cluster. This
means that I am not fully utilizing my cluster, and my Job will not
scale.

Can anyone shed some light on this problem. I have tried looking at
the InputFormats but I am not sure if this is where I should continue
looking.

Best Regards
Vincent

Mime
View raw message