hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2921) align map splits on sorted files with key boundaries
Date Sun, 02 Mar 2008 15:42:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574223#action_12574223
] 

Joydeep Sen Sarma commented on HADOOP-2921:
-------------------------------------------

no - didn't override getSplit. i have an inputformat that opens sequencefile readers for two
splits. one is the split handed down from the map task. the other is a split that contains
the rest of the file (positioned after the map split). 

we skip the first set of records in the map split (unless starting at offset 0). and we process
the first set of records in the next split. (ditto as how sequencefiles work with sync markers
- using sort key boundaries as sync positions instead)

> align map splits on sorted files with key boundaries
> ----------------------------------------------------
>
>                 Key: HADOOP-2921
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2921
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.16.0
>            Reporter: Joydeep Sen Sarma
>
> (this is something that we have implemented in the application layer - may be useful
to have in hadoop itself).
> long term log storage systems often keep data sorted (by some sort-key). future computations
on such files can often benefit from this sort order. if the job requires grouping by the
sort-key - then it should be possible to do reduction in the map stage itself.
> this is not natively supported by hadoop (except in the degenerate case of 1 map file
per task) since splits can span the sort-key. however aligning the data read by the map task
 to sort key boundaries is straightforward - and this would be a useful capability to have
in hadoop.
> the definition of the sort key should be left up to the application (it's not necessarily
the key field in a Sequencefile) through a generic interface - but otherwise - the sequencefile
and text file readers can use the extracted sort key to align map task data with key boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message