Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 54814 invoked from network); 4 Mar 2008 17:50:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 4 Mar 2008 17:50:58 -0000 Received: (qmail 35068 invoked by uid 500); 4 Mar 2008 17:50:48 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 35024 invoked by uid 500); 4 Mar 2008 17:50:48 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 35008 invoked by uid 99); 4 Mar 2008 17:50:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Mar 2008 09:50:47 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Mar 2008 17:50:20 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 7EA5B234C089 for ; Tue, 4 Mar 2008 09:49:42 -0800 (PST) Message-ID: <2145320844.1204652982517.JavaMail.jira@brutus> Date: Tue, 4 Mar 2008 09:49:42 -0800 (PST) From: "Joydeep Sen Sarma (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-2921) align map splits on sorted files with key boundaries In-Reply-To: <1547603062.1204440410951.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12575073#action_12575073 ] Joydeep Sen Sarma commented on HADOOP-2921: ------------------------------------------- > Why do you prefer using values to keys? we don't use keys at all. We are using Hadoop as a row oriented database - where the value encodes a row. The sort field is embedded inside the row (ie. value) itself and it would be redundant to store it in the key. So we save space and don't put it there. JAQL (and i believe Cascading) also do the same. I am not sure about Pig. The Partitioner interface also allows partitioning based on key and value - so there seems to be a precedent here. > align map splits on sorted files with key boundaries > ---------------------------------------------------- > > Key: HADOOP-2921 > URL: https://issues.apache.org/jira/browse/HADOOP-2921 > Project: Hadoop Core > Issue Type: New Feature > Affects Versions: 0.16.0 > Reporter: Joydeep Sen Sarma > > (this is something that we have implemented in the application layer - may be useful to have in hadoop itself). > long term log storage systems often keep data sorted (by some sort-key). future computations on such files can often benefit from this sort order. if the job requires grouping by the sort-key - then it should be possible to do reduction in the map stage itself. > this is not natively supported by hadoop (except in the degenerate case of 1 map file per task) since splits can span the sort-key. however aligning the data read by the map task to sort key boundaries is straightforward - and this would be a useful capability to have in hadoop. > the definition of the sort key should be left up to the application (it's not necessarily the key field in a Sequencefile) through a generic interface - but otherwise - the sequencefile and text file readers can use the extracted sort key to align map task data with key boundaries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.