hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "eric baldeschwieler (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-346) Generic 'Sort' Infrastructure for Map-Reduce framework.
Date Thu, 06 Jul 2006 19:30:30 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-346?page=comments#action_12419587 ] 

eric baldeschwieler commented on HADOOP-346:
--------------------------------------------

I'd like to see the comparators config explicitly modelled after unix sort unless a better
model exists in the java world?  Could we add the spec for how it is configured here?  Maybe
a single string with unix sort options on it?

This should not be limited to utf8.  We should be able to split on bytes or lengths and handle
binary keys too.  

> Generic 'Sort' Infrastructure for Map-Reduce framework.
> -------------------------------------------------------
>
>          Key: HADOOP-346
>          URL: http://issues.apache.org/jira/browse/HADOOP-346
>      Project: Hadoop
>         Type: New Feature

>   Components: mapred
>     Reporter: Arun C Murthy
>     Assignee: Arun C Murthy

>
> It would be useful to add a generic *sort* infrastructure to the Map-Reduce framework
to ease usage.
> Specifically the idea to add a fairly generic and powerful *comparator* which can be
configured by the user to meet his specific needs.
> Spec:
> --------
>  
>   The proposal is to model generic (uber) comparator along the lines of the the standard
unix *sort* command. The comparator provides the following (configurable) functionality:
>   a) Separator for breaking up the data (stream) into 'columns'.
>   b) Multiple key ranges for specifying priorities of 'columns'. (ala --keys/-k option
of unix sort i.e. -k 2,3 -k 1,4 etc.)
>   c) A variant of a) to let user specify byte range-boundaries without using a separator
for 'columns'.
>   d) Option to sort 'reverse'.
>   e) Option to do a 'stable' sort i.e. don't do a last-ditch comparision of all bytes
if all key ranges match.
>   f) Option to do 'numeric' comparisions instead of lexicographical comparisions?
>   Of course all these are optional with the default behaviour as-is today.
>      - * - * -
>  Anything more/less?
> thanks,
> Arun

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message