Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-dev@hadoop.apache.org
Message-ID: <276099798.1204472690447.JavaMail.jira@brutus>
Date: Sun, 2 Mar 2008 07:44:50 -0800 (PST)
From: "Joydeep Sen Sarma (JIRA)" <jira@apache.org>
To: core-dev@hadoop.apache.org
Subject: [jira] Commented: (HADOOP-2921) align map splits on sorted files
 with key boundaries
In-Reply-To: <1547603062.1204440410951.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574224#action_12574224 ] 

Joydeep Sen Sarma commented on HADOOP-2921:
-------------------------------------------

oh btw - the reason for doing it like this was that i wouldn't have been able to do this by subclassing sequencefileinputformat itself. most of the important variables are private - and i didn't want to change the core code. so tried to keep it in the app layer.

but obviously - would be more efficient to implement in the sequencefile code itself.

> align map splits on sorted files with key boundaries
> ----------------------------------------------------
>
>                 Key: HADOOP-2921
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2921
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.16.0
>            Reporter: Joydeep Sen Sarma
>
> (this is something that we have implemented in the application layer - may be useful to have in hadoop itself).
> long term log storage systems often keep data sorted (by some sort-key). future computations on such files can often benefit from this sort order. if the job requires grouping by the sort-key - then it should be possible to do reduction in the map stage itself.
> this is not natively supported by hadoop (except in the degenerate case of 1 map file per task) since splits can span the sort-key. however aligning the data read by the map task  to sort key boundaries is straightforward - and this would be a useful capability to have in hadoop.
> the definition of the sort key should be left up to the application (it's not necessarily the key field in a Sequencefile) through a generic interface - but otherwise - the sequencefile and text file readers can use the extracted sort key to align map task data with key boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.