Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-dev@hadoop.apache.org
Message-ID: <2145320844.1204652982517.JavaMail.jira@brutus>
Date: Tue, 4 Mar 2008 09:49:42 -0800 (PST)
From: "Joydeep Sen Sarma (JIRA)" <jira@apache.org>
To: core-dev@hadoop.apache.org
Subject: [jira] Commented: (HADOOP-2921) align map splits on sorted files
 with key boundaries
In-Reply-To: <1547603062.1204440410951.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12575073#action_12575073 ] 

Joydeep Sen Sarma commented on HADOOP-2921:
-------------------------------------------

> Why do you prefer using values to keys?

 we don't use keys at all. We are using Hadoop as a row oriented database - where the value encodes a row. The sort field is embedded inside the row (ie. value) itself and it would be redundant to store it in the key. So we save space and don't put it there. JAQL (and i believe Cascading) also do the same. I am not sure about Pig.

The Partitioner interface also allows partitioning based on key and value - so there seems to be a precedent here. 

> align map splits on sorted files with key boundaries
> ----------------------------------------------------
>
>                 Key: HADOOP-2921
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2921
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.16.0
>            Reporter: Joydeep Sen Sarma
>
> (this is something that we have implemented in the application layer - may be useful to have in hadoop itself).
> long term log storage systems often keep data sorted (by some sort-key). future computations on such files can often benefit from this sort order. if the job requires grouping by the sort-key - then it should be possible to do reduction in the map stage itself.
> this is not natively supported by hadoop (except in the degenerate case of 1 map file per task) since splits can span the sort-key. however aligning the data read by the map task  to sort key boundaries is straightforward - and this would be a useful capability to have in hadoop.
> the definition of the sort key should be left up to the application (it's not necessarily the key field in a Sequencefile) through a generic interface - but otherwise - the sequencefile and text file readers can use the extracted sort key to align map task data with key boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.