Return-Path: X-Original-To: apmail-hadoop-common-dev-archive@www.apache.org Delivered-To: apmail-hadoop-common-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3CF2E19B2 for ; Tue, 26 Apr 2011 15:34:41 +0000 (UTC) Received: (qmail 14908 invoked by uid 500); 26 Apr 2011 15:34:39 -0000 Delivered-To: apmail-hadoop-common-dev-archive@hadoop.apache.org Received: (qmail 14843 invoked by uid 500); 26 Apr 2011 15:34:39 -0000 Mailing-List: contact common-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-dev@hadoop.apache.org Received: (qmail 14833 invoked by uid 99); 26 Apr 2011 15:34:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Apr 2011 15:34:39 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of masokan@syncsort.com designates 198.62.239.139 as permitted sender) Received: from [198.62.239.139] (HELO mailcas1.us.syncsort.com) (198.62.239.139) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Apr 2011 15:34:33 +0000 Received: from mailccr.us.syncsort.com ([fe80::114a:1402:3b4d:7191]) by mailcas1 ([192.168.172.100]) with mapi; Tue, 26 Apr 2011 11:34:09 -0400 From: "Asokan, M" To: "common-dev@hadoop.apache.org" CC: "Asokan, M" Date: Tue, 26 Apr 2011 11:34:09 -0400 Subject: A pluggable external sort for Hadoop MR Thread-Topic: A pluggable external sort for Hadoop MR Thread-Index: AcwEJ2MJ/k92XkGcSqusxuPJ41kkdA== Message-ID: <4DB6E5F1.8030900@syncsort.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: user-agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.9.2.12) Gecko/20101027 Lightning/1.0b2 Thunderbird/3.1.6 acceptlanguage: en-US Content-Type: multipart/alternative; boundary="_000_4DB6E5F18030900syncsortcom_" MIME-Version: 1.0 --_000_4DB6E5F18030900syncsortcom_ Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable Hi All, I am submitting this notice of intent to contribute to the Hadoop community= on behalf of Syncsort, Inc. (www.syncsort.com) an= interface for an external sorter. Although Hadoop MR (Map/Reduce) provide= s users with pluggable InputFormat, Mapper, Partitioner, Combiner, Reducer,= and OutputFormat it does not provide a plug-in for an external sorter. The= re is limited support to plug in a sorter class in the Map phase. The merg= e logic in the Reduce phase cannot be changed. Also, the sorting process i= s tightly coupled to the framework. The goal of our project is to decouple the sorting process and contribute a= defined clean interface to allow developers to easily plug in external sor= ters through this interface. THIS INTERFACE WILL BE INDEPENDENT FROM SYNCS= ORT=92S PROPRIETARY SOFTWARE PRODUCTS WHICH ARE NOT INTENDED TO BE CONTRIBU= TED. The following are some of the motivating factors for this project (not in a= ny order of significance): =B7 An external sort plug-in will promote innovative implementation= s by developers who have expertise in sort algorithms. =B7 Hadoop developers can experiment with different sort implementa= tions (in both the Map and Reduce phases) without modifying the framework c= ode. =B7 An external implementation of sort can be very well optimized t= o take advantage of OS and hardware architecture compared to the pure Java = implementation in Hadoop. =B7 The Hadoop implementation of sort is not self tuning. Users may= be overwhelmed by so many parameters to be specified to tune the performan= ce of sort. =B7 One of the top memory consumers in the MR child JVMs is the sor= t. Users are advised to set a reasonably high value for -mx argument to JV= M. Failure to do so will result in job termination. If the external sorter = is implemented as a subprocess, it can adjust its memory usage automaticall= y and make sure that it does not fail. Besides, the memory needed by the MR= child JVM can be reduced to a meager 128 MB. =B7 The performance of Hadoop sort may be at the mercy of JVM. See = LUCENE-2504 in Hadoop Jira for a related performance regression issue. An e= xternal sorter implemented in C or C++ and run as a subprocess will not suf= fer from these types of problems. =B7 ETL tool vendors can complement Hadoop's strengths namely HDFS,= job scheduling, restartability, etc. with their sort technologies. This wi= ll enable Hadoop to make inroads into IT shops that use traditional ETL too= ls. The goals of this project are: =B7 The primary goal of this project is to allow users to seamlessl= y plug in the external sorter to their existing MR applications. This is in= contrast to the approach taken by HCE (see MAPREDUCE-1270 in Hadoop Jira) = which requires users to code their MR applications in C++. =B7 A secondary goal is to enable users of existing ETL tools to ex= ploit Hadoop's distributed processing framework. We are confident there will be interest in this contribution to the code to= the Hadoop community. I intend to provide a reference implementation of th= e interfaces defined in the design. This reference implementation uses GNU = sort command to do the sorting of text data. -- Asokan M. Asokan Technology Architect =96 Data Integration Syncsort Incorporated 50 Tice Boulevard, Woodcliff Lake, NJ 07677 P: 201-930-8226 | F: 201-930-8281 E: masokan@syncsort.com www.syncsort.com Rethink the economics of data ________________ ________________________________ ATTENTION: ----- The information contained in this message (including any files transmitted = with this message) may contain proprietary, trade secret or other confident= ial and/or legally privileged information. Any pricing information containe= d in this message or in any files transmitted with this message is always c= onfidential and cannot be shared with any third parties without prior writt= en approval from Syncsort. This message is intended to be read only by the = individual or entity to whom it is addressed or by their designee. If the r= eader of this message is not the intended recipient, you are on notice that= any use, disclosure, copying or distribution of this message, in any form,= is strictly prohibited. If you have received this message in error, please= immediately notify the sender and/or Syncsort and destroy all copies of th= is message in your possession, custody or control. --_000_4DB6E5F18030900syncsortcom_--