Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 83708 invoked from network); 2 Mar 2008 15:45:55 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 2 Mar 2008 15:45:55 -0000 Received: (qmail 75602 invoked by uid 500); 2 Mar 2008 15:45:49 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 75576 invoked by uid 500); 2 Mar 2008 15:45:49 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 75567 invoked by uid 99); 2 Mar 2008 15:45:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 02 Mar 2008 07:45:49 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 02 Mar 2008 15:45:23 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 6D783234C084 for ; Sun, 2 Mar 2008 07:44:50 -0800 (PST) Message-ID: <276099798.1204472690447.JavaMail.jira@brutus> Date: Sun, 2 Mar 2008 07:44:50 -0800 (PST) From: "Joydeep Sen Sarma (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-2921) align map splits on sorted files with key boundaries In-Reply-To: <1547603062.1204440410951.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574224#action_12574224 ] Joydeep Sen Sarma commented on HADOOP-2921: ------------------------------------------- oh btw - the reason for doing it like this was that i wouldn't have been able to do this by subclassing sequencefileinputformat itself. most of the important variables are private - and i didn't want to change the core code. so tried to keep it in the app layer. but obviously - would be more efficient to implement in the sequencefile code itself. > align map splits on sorted files with key boundaries > ---------------------------------------------------- > > Key: HADOOP-2921 > URL: https://issues.apache.org/jira/browse/HADOOP-2921 > Project: Hadoop Core > Issue Type: New Feature > Affects Versions: 0.16.0 > Reporter: Joydeep Sen Sarma > > (this is something that we have implemented in the application layer - may be useful to have in hadoop itself). > long term log storage systems often keep data sorted (by some sort-key). future computations on such files can often benefit from this sort order. if the job requires grouping by the sort-key - then it should be possible to do reduction in the map stage itself. > this is not natively supported by hadoop (except in the degenerate case of 1 map file per task) since splits can span the sort-key. however aligning the data read by the map task to sort key boundaries is straightforward - and this would be a useful capability to have in hadoop. > the definition of the sort key should be left up to the application (it's not necessarily the key field in a Sequencefile) through a generic interface - but otherwise - the sequencefile and text file readers can use the extracted sort key to align map task data with key boundaries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.