hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "arkady borkovsky (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-765) Hadoop Streaming should (optionally) sort on secondary key
Date Thu, 30 Nov 2006 00:27:21 GMT
Hadoop Streaming should (optionally) sort on secondary key

                 Key: HADOOP-765
                 URL: http://issues.apache.org/jira/browse/HADOOP-765
             Project: Hadoop
          Issue Type: Improvement
            Reporter: arkady borkovsky

This is related to HADOOP-485

As described in HADOOP-485 and HADOOP-686,  many algorithms need the values to come in specific
(The most prominent is JOIN : in MapReduce implementation of JOIN, the value has to indicate
which "table" the record comes from.  It is very useful to have records from the smaller "table"
to come first.)

(a) once HADOOP-485 is implemented, it should be propagated to Streaming so that sorting by
secondary is done without writing any code, but just with specifying a parameter.

(b) alternatively, as Hadoop Streaming records are lines of text with key(s) separated from
the value by a tab, a simple hack of running a sort on the MERGED input of reduce will work
fine.   This may be quite efficient and easy way to implement this important feature without
relying on  HADOOP-485.   

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message