hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joey Echeverria <j...@cloudera.com>
Subject Re: Mapping one key per Map Task
Date Mon, 23 May 2011 12:12:11 GMT
Look at getInputSplits() of SequenceFileInputFormat.

On May 23, 2011 5:09 AM, "Vincent Xue" <xue.vin@gmail.com> wrote:
> Hello Hadoop Users,
> I would like to know if anyone has ever tried splitting an input
> sequence file by key instead of by size. I know that this is unusual
> for the map reduce paradigm but I am in a situation where I need to
> perform some large tasks on each key pair in a load balancing like
> fashion.
> To describe in more detail:
> I have one sequence file of 2000 key value pairs. I want to distribute
> each key value pair to a map task where it will perform a series of
> Map/Reduce tasks. This means that the Map task is calling a series of
> Jobs. Once the Jobs in each task is complete, I want to reduce all of
> the output into one sequence file.
> I am stuck in which I am limited by the number of splits a sequence
> file is handled in. Hadoop only splits my sequence file into 80 map
> tasks when I can perform around 250 map tasks on my cluster. This
> means that I am not fully utilizing my cluster, and my Job will not
> scale.
> Can anyone shed some light on this problem. I have tried looking at
> the InputFormats but I am not sure if this is where I should continue
> looking.
> Best Regards
> Vincent

View raw message