hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-346) Generic 'Sort' Infrastructure for Map-Reduce framework.
Date Thu, 06 Jul 2006 17:59:30 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-346?page=comments#action_12419572 ] 

Owen O'Malley commented on HADOOP-346:
--------------------------------------

This should be done by implementing a new WritableComparator, which can be selected by calling
JobConf.setKeyOutputComparator(). It does not need to change the framework's sort code. The
configuration should be done via the job conf along the lines of:

conf.set("comparator.generic.utf8.keys",  "2,4,3"); // columns 2, 4, and 3 are the sort key
conf.set("comparator.generic.utf8.deliminator", " "); // how to split columns
conf.setBoolean("comparator.generic.utf8.reverse", true); // sort backwards

I assume this comparator is just limited to keys that are UTF8. The corresponding comparator
for Hadoop record io would also make sense, but there the fields would be given by name. For
example,

  class Foo {
     int field1;
     int field2;
     ustring field3;
  }

You'd like to set "comparator.generic.record.keys" to "field3,field2". But the record io generic
comparator is obviously a different bug. *smile*

You won't be able to implement a stable sort without a lot of work. Do you have applications
that need stable sorts?

> Generic 'Sort' Infrastructure for Map-Reduce framework.
> -------------------------------------------------------
>
>          Key: HADOOP-346
>          URL: http://issues.apache.org/jira/browse/HADOOP-346
>      Project: Hadoop
>         Type: New Feature

>   Components: mapred
>     Reporter: Arun C Murthy
>     Assignee: Arun C Murthy

>
> It would be useful to add a generic *sort* infrastructure to the Map-Reduce framework
to ease usage.
> Specifically the idea to add a fairly generic and powerful *comparator* which can be
configured by the user to meet his specific needs.
> Spec:
> --------
>  
>   The proposal is to model generic (uber) comparator along the lines of the the standard
unix *sort* command. The comparator provides the following (configurable) functionality:
>   a) Separator for breaking up the data (stream) into 'columns'.
>   b) Multiple key ranges for specifying priorities of 'columns'. (ala --keys/-k option
of unix sort i.e. -k 2,3 -k 1,4 etc.)
>   c) A variant of a) to let user specify byte range-boundaries without using a separator
for 'columns'.
>   d) Option to sort 'reverse'.
>   e) Option to do a 'stable' sort i.e. don't do a last-ditch comparision of all bytes
if all key ranges match.
>   f) Option to do 'numeric' comparisions instead of lexicographical comparisions?
>   Of course all these are optional with the default behaviour as-is today.
>      - * - * -
>  Anything more/less?
> thanks,
> Arun

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message