From common-dev-return-75157-apmail-hadoop-common-dev-archive=hadoop.apache.org@hadoop.apache.org Tue Apr 26 15:42:16 2011 Return-Path: X-Original-To: apmail-hadoop-common-dev-archive@www.apache.org Delivered-To: apmail-hadoop-common-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 721FC1BDE for ; Tue, 26 Apr 2011 15:42:16 +0000 (UTC) Received: (qmail 29564 invoked by uid 500); 26 Apr 2011 15:42:15 -0000 Delivered-To: apmail-hadoop-common-dev-archive@hadoop.apache.org Received: (qmail 29522 invoked by uid 500); 26 Apr 2011 15:42:15 -0000 Mailing-List: contact common-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-dev@hadoop.apache.org Received: (qmail 29514 invoked by uid 99); 26 Apr 2011 15:42:15 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Apr 2011 15:42:15 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of cbsmith@gmail.com designates 209.85.214.176 as permitted sender) Received: from [209.85.214.176] (HELO mail-iw0-f176.google.com) (209.85.214.176) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Apr 2011 15:42:07 +0000 Received: by iwr19 with SMTP id 19so934403iwr.35 for ; Tue, 26 Apr 2011 08:41:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:references:in-reply-to:mime-version :content-transfer-encoding:content-type:message-id:cc:x-mailer:from :subject:date:to; bh=q0CugGel5hESCsmbIS2XUSoXqD33Iel+z60daoe+p20=; b=PWgm7uWFCJNP+RAjxZYDcqpKg6rc4gBcZAamUn+cVEhz1PI9a+fVvm66ZQy65os1t+ za+fUvycpEjxp8S7INnhfcSymgvQq8gvCkf1lFqtBL0M/2aLEIrKTFfDO5XOQI97w5Ce QlrPCXKDR1vwgnk2EJJY1aBvqP5ZyO3Ky6rKc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=references:in-reply-to:mime-version:content-transfer-encoding :content-type:message-id:cc:x-mailer:from:subject:date:to; b=Salc9B8L6ZP60YeIzKoQjFp7EUr4eIg//kSLDFLEUsuCq4rxpiQgU9lScE8UBoM4Ta jaa0Vm5SUV9NEblEMlPHyUbcLCHDgBCNdv+7+DZAZMJWxEKDps64Op0enaZT/1qpQQls Ns7Fjn2l/YnrlmeTjofiHTreqsMfndURqRynE= Received: by 10.42.95.1 with SMTP id d1mr1103282icn.254.1303832505644; Tue, 26 Apr 2011 08:41:45 -0700 (PDT) Received: from [10.16.86.114] ([198.228.208.177]) by mx.google.com with ESMTPS id 19sm2683424ibx.52.2011.04.26.08.41.39 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 26 Apr 2011 08:41:41 -0700 (PDT) References: <4DB6E5F1.8030900@syncsort.com> In-Reply-To: <4DB6E5F1.8030900@syncsort.com> Mime-Version: 1.0 (iPhone Mail 8H7) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Message-Id: Cc: "common-dev@hadoop.apache.org" , "Asokan, M" X-Mailer: iPhone Mail (8H7) From: Christopher Smith Subject: Re: A pluggable external sort for Hadoop MR Date: Tue, 26 Apr 2011 08:41:31 -0700 To: "common-dev@hadoop.apache.org" X-Virus-Checked: Checked by ClamAV on apache.org Aren't you worried that the overhead of shoving all that data through an ext= ernal sort facility would outweigh any benefits from the algo? --Chris On Apr 26, 2011, at 8:34 AM, "Asokan, M" wrote: > Hi All, >=20 > I am submitting this notice of intent to contribute to the Hadoop communit= y on behalf of Syncsort, Inc. (www.syncsort.com) an= interface for an external sorter. Although Hadoop MR (Map/Reduce) provides= users with pluggable InputFormat, Mapper, Partitioner, Combiner, Reducer, a= nd OutputFormat it does not provide a plug-in for an external sorter. There i= s limited support to plug in a sorter class in the Map phase. The merge log= ic in the Reduce phase cannot be changed. Also, the sorting process is tigh= tly coupled to the framework. >=20 >=20 >=20 > The goal of our project is to decouple the sorting process and contribute a= defined clean interface to allow developers to easily plug in external sort= ers through this interface. THIS INTERFACE WILL BE INDEPENDENT FROM SYNCSOR= T=E2=80=99S PROPRIETARY SOFTWARE PRODUCTS WHICH ARE NOT INTENDED TO BE CONTR= IBUTED. >=20 > The following are some of the motivating factors for this project (not in a= ny order of significance): > =C2=B7 An external sort plug-in will promote innovative implementa= tions by developers who have expertise in sort algorithms. > =C2=B7 Hadoop developers can experiment with different sort implem= entations (in both the Map and Reduce phases) without modifying the framewor= k code. > =C2=B7 An external implementation of sort can be very well optimiz= ed to take advantage of OS and hardware architecture compared to the pure Ja= va implementation in Hadoop. > =C2=B7 The Hadoop implementation of sort is not self tuning. Users= may be overwhelmed by so many parameters to be specified to tune the perfor= mance of sort. > =C2=B7 One of the top memory consumers in the MR child JVMs is the= sort. Users are advised to set a reasonably high value for -mx argument to= JVM. Failure to do so will result in job termination. If the external sorte= r is implemented as a subprocess, it can adjust its memory usage automatical= ly and make sure that it does not fail. Besides, the memory needed by the MR= child JVM can be reduced to a meager 128 MB. > =C2=B7 The performance of Hadoop sort may be at the mercy of JVM. S= ee LUCENE-2504 in Hadoop Jira for a related performance regression issue. An= external sorter implemented in C or C++ and run as a subprocess will not su= ffer from these types of problems. > =C2=B7 ETL tool vendors can complement Hadoop's strengths namely H= DFS, job scheduling, restartability, etc. with their sort technologies. This= will enable Hadoop to make inroads into IT shops that use traditional ETL t= ools. > The goals of this project are: > =C2=B7 The primary goal of this project is to allow users to seaml= essly plug in the external sorter to their existing MR applications. This is= in contrast to the approach taken by HCE (see MAPREDUCE-1270 in Hadoop Jira= ) which requires users to code their MR applications in C++. > =C2=B7 A secondary goal is to enable users of existing ETL tools t= o exploit Hadoop's distributed processing framework. >=20 > We are confident there will be interest in this contribution to the code t= o the Hadoop community. I intend to provide a reference implementation of th= e interfaces defined in the design. This reference implementation uses GNU s= ort command to do the sorting of text data. >=20 > -- Asokan >=20 > M. Asokan > Technology Architect =E2=80=93 Data Integration >=20 > Syncsort Incorporated > 50 Tice Boulevard, Woodcliff Lake, NJ 07677 > P: 201-930-8226 | F: 201-930-8281 > E: masokan@syncsort.com > www.syncsort.com >=20 > Rethink the economics of data > ________________ >=20 >=20 >=20 > ________________________________ >=20 >=20 > ATTENTION: ----- >=20 > The information contained in this message (including any files transmitted= with this message) may contain proprietary, trade secret or other confident= ial and/or legally privileged information. Any pricing information contained= in this message or in any files transmitted with this message is always con= fidential and cannot be shared with any third parties without prior written a= pproval from Syncsort. This message is intended to be read only by the indiv= idual or entity to whom it is addressed or by their designee. If the reader o= f this message is not the intended recipient, you are on notice that any use= , disclosure, copying or distribution of this message, in any form, is stric= tly prohibited. If you have received this message in error, please immediate= ly notify the sender and/or Syncsort and destroy all copies of this message i= n your possession, custody or control.