hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Kimball (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1216) MRUnit Should Sort Reduce Input
Date Sat, 28 Nov 2009 00:43:20 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783190#action_12783190
] 

Aaron Kimball commented on MAPREDUCE-1216:
------------------------------------------

I'm not sure I follow your logic. The values associated with a key do not have a well-defined
sort order. Therefore, neither should MRUnit. However, the keys associated with a reducer
are sorted. 

In MRUnit a ReduceDriver can only handle a single input key, but the MapReduceDriver will
allow you to pass an arbitrary set of key-value pairs (bounded by memory) from the mapper
to the reducer. The keys forwarded from the mapper to the reducer via this driver will be
sorted.

If you want to test Hadoop's sorting semantics yourself, create two files. Put the letters
'a' through 'z' in one file, one-per-line, in ascending order. In the other file, put the
same lines in descending order ('z' down to 'a').

Then run this program:

{code}
public class Foo {
  public static void main(String [] args) throws Exception {
    Job job = new Job();
    job.setJarByClass(Foo.class);
    FileInputFormat.addInputPath(job, new Path("in"));
    FileOutputFormat.setOutputPath(job, new Path("out"));
    job.setNumReduceTasks(1);
    job.waitForCompletion(true);
  }
}
{code}

This will use the identity mapper and reducer.

A fragment of the output I get is:
{code}
44	d
44	w
46	x
46	c
{code}

.. demonstrating that the values are not necessarily sorted within a key.

I think that MRUnit has helped you catch a bug in your deduplication reducer :)


> MRUnit Should Sort Reduce Input
> -------------------------------
>
>                 Key: MAPREDUCE-1216
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1216
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.1
>         Environment: Cloudera Distribution for Hadoop 0.20.1 + 133
>            Reporter: Ed Kohlwey
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> MRUnit should sort the input for a reduce task, the same way hadoop does.
> This is useful if you have a reduce task that, for instance, removes duplicate key value
pairs.
> example:
> {code:java}
> class BadReducer extends Reducer{
> public void reduce(...){
>  Text last = new Text();
>  for(Text text: values){
>    if(!text.equals(last)){
>      context.write(key, text);
>      last.set(text);
>     }
>   }
>  }
> }
> {code}
> {code:java}
> ReduceDriver driver = new ReduceDriver()
> driver.setInputKey("foo");
> driver.addInputValue("bar");
> driver.addInputValue("bar");
> driver.addInputValue("foo");
> {code}
> produces different results than 
> {code:java}
> ReduceDriver driver = new ReduceDriver()
> driver.setInputKey("foo");
> driver.addInputValue("bar");
> driver.addInputValue("foo");
> driver.addInputValue("bar");
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message