hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2921) align map splits on sorted files with key boundaries
Date Sun, 02 Mar 2008 19:16:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574245#action_12574245
] 

Owen O'Malley commented on HADOOP-2921:
---------------------------------------

I don't think changing the semantics of the current seqeunce file record reader to do this
is a good idea. In the degenerate case, you could end up with a lot of your maps having no
inputs.

Joydeep, would a grouping comparator like the one we use to group the reduce inputs work here?
I assume it is the case that you'd want to group on a subset of the fields in the keys, since
that controls the sort.


> align map splits on sorted files with key boundaries
> ----------------------------------------------------
>
>                 Key: HADOOP-2921
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2921
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.16.0
>            Reporter: Joydeep Sen Sarma
>
> (this is something that we have implemented in the application layer - may be useful
to have in hadoop itself).
> long term log storage systems often keep data sorted (by some sort-key). future computations
on such files can often benefit from this sort order. if the job requires grouping by the
sort-key - then it should be possible to do reduction in the map stage itself.
> this is not natively supported by hadoop (except in the degenerate case of 1 map file
per task) since splits can span the sort-key. however aligning the data read by the map task
 to sort key boundaries is straightforward - and this would be a useful capability to have
in hadoop.
> the definition of the sort key should be left up to the application (it's not necessarily
the key field in a Sequencefile) through a generic interface - but otherwise - the sequencefile
and text file readers can use the extracted sort key to align map task data with key boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message