hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-2921) align map splits on sorted files with key boundaries
Date Sun, 02 Mar 2008 06:46:50 GMT
align map splits on sorted files with key boundaries

                 Key: HADOOP-2921
                 URL: https://issues.apache.org/jira/browse/HADOOP-2921
             Project: Hadoop Core
          Issue Type: New Feature
    Affects Versions: 0.16.0
            Reporter: Joydeep Sen Sarma

(this is something that we have implemented in the application layer - may be useful to have
in hadoop itself).

long term log storage systems often keep data sorted (by some sort-key). future computations
on such files can often benefit from this sort order. if the job requires grouping by the
sort-key - then it should be possible to do reduction in the map stage itself.

this is not natively supported by hadoop (except in the degenerate case of 1 map file per
task) since splits can span the sort-key. however aligning the data read by the map task 
to sort key boundaries is straightforward - and this would be a useful capability to have
in hadoop.

the definition of the sort key should be left up to the application (it's not necessarily
the key field in a Sequencefile) through a generic interface - but otherwise - the sequencefile
and text file readers can use the extracted sort key to align map task data with key boundaries.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message