hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Hammerbacher <ham...@cloudera.com>
Subject Re: A pluggable external sort for Hadoop MR
Date Tue, 26 Apr 2011 18:00:14 GMT
Hey Asokan,

Could you please file a JIRA with your proposed enhancement so that the
discussion can be archived there? See
http://wiki.apache.org/hadoop/HowToContribute for more details on how to
contribute to Hadoop.


On Tue, Apr 26, 2011 at 9:46 AM, Asokan, M <masokan@syncsort.com> wrote:

> Hi Chris,
>   The overall elapsed time to run a sort depends on many factors other than
> the sort algorithm.  If you follow the data flow in MR from the point where
> sorting starts in Map phase to the point where <Key, Value> pairs are
> available for reduction in Reduce phase there are CPU and IO intensive
> activities happening.  You are right, passing data to an external process
> adds CPU cycles.  However, a well engineered implementation of the overall
> process can cut down the elapsed time.  From some of my experiments with a
> prototype implementation, I was able to cut down the elapsed time by about
> 40% to run some huge sorts(500 GB) on a modest cluster of 6 nodes.
> Besides, an external sorter can provide additional functionalities to
> Hadoop.  For example, on the Map side, an external sorter process can
> support filtering, reformatting, and aggregation in a single process with
> performance optimized for a multicore system.  With the current MR
> framework, filtering and reformatting happen before sorting and all these
> operations are very sequential in nature.  On the Reduce side, an external
> sorter can offer even exotic solution like Join since the external sorter
> implementation on the Reduce side is free to work on more than one
> stream(one from Hadoop MR shuffled data and the other from HDFS for
> example.)
> Thank you very much for your feedback.  If you any more questions, please
> let me know.
> -- Asokan
> On 04/26/2011 11:41 AM, Christopher Smith wrote:
> Aren't you worried that the overhead of shoving all that data through an
> external sort facility would outweigh any benefits from the algo?
> --Chris
> On Apr 26, 2011, at 8:34 AM, "Asokan, M" <masokan@syncsort.com><mailto:
> masokan@syncsort.com> wrote:
> Hi All,
> I am submitting this notice of intent to contribute to the Hadoop community
> on behalf of Syncsort, Inc. (www.syncsort.com<http://www.syncsort.com><
> http://www.syncsort.com><http://www.syncsort.com>) an interface for an
> external sorter.  Although Hadoop MR (Map/Reduce) provides users with
> pluggable InputFormat, Mapper, Partitioner, Combiner, Reducer, and
> OutputFormat it does not provide a plug-in for an external sorter. There is
> limited support to plug in a sorter class in the Map phase.  The merge logic
> in the Reduce phase cannot be changed.  Also, the sorting process is tightly
> coupled to the framework.
> The goal of our project is to decouple the sorting process and contribute a
> defined clean interface to allow developers to easily plug in external
> sorters through this interface.  THIS INTERFACE WILL BE INDEPENDENT FROM
> The following are some of the motivating factors for this project (not in
> any order of significance):
> ·         An external sort plug-in will promote innovative implementations
> by developers who have expertise in sort algorithms.
> ·         Hadoop developers can experiment with different sort
> implementations (in both the Map and Reduce phases) without modifying the
> framework code.
> ·         An external implementation of sort can be very well optimized to
> take advantage of OS and hardware architecture compared to the pure Java
> implementation in Hadoop.
> ·         The Hadoop implementation of sort is not self tuning. Users may
> be overwhelmed by so many parameters to be specified to tune the performance
> of sort.
> ·         One of the top memory consumers in the MR child JVMs is the sort.
>  Users are advised to set a reasonably high value for -mx argument to JVM.
> Failure to do so will result in job termination. If the external sorter is
> implemented as a subprocess, it can adjust its memory usage automatically
> and make sure that it does not fail. Besides, the memory needed by the MR
> child JVM can be reduced to a meager 128 MB.
> ·         The performance of Hadoop sort may be at the mercy of JVM. See
> LUCENE-2504 in Hadoop Jira for a related performance regression issue. An
> external sorter implemented in C or C++ and run as a subprocess will not
> suffer from these types of problems.
> ·         ETL tool vendors can complement Hadoop's strengths namely HDFS,
> job scheduling, restartability, etc. with their sort technologies. This will
> enable Hadoop to make inroads into IT shops that use traditional ETL tools.
> The goals of this project are:
> ·         The primary goal of this project is to allow users to seamlessly
> plug in the external sorter to their existing MR applications. This is in
> contrast to the approach taken by HCE (see MAPREDUCE-1270 in Hadoop Jira)
> which requires users to code their MR applications in C++.
> ·         A secondary goal is to enable users of existing ETL tools to
> exploit Hadoop's distributed processing framework.
> We are confident there will be interest in this contribution to the code to
> the Hadoop community. I intend to provide a reference implementation of the
> interfaces defined in the design. This reference implementation uses GNU
> sort command to do the sorting of text data.
> -- Asokan
> M. Asokan
> Technology Architect – Data Integration
> Syncsort Incorporated
> 50 Tice Boulevard, Woodcliff Lake, NJ 07677
> P: 201-930-8226 | F: 201-930-8281
> E: masokan@syncsort.com<mailto:masokan@syncsort.com><mailto:%
> 20masokan@syncsort.com><mailto:%20masokan@syncsort.com>
> www.syncsort.com<http://www.syncsort.com><http://www.syncsort.com/><
> http://www.syncsort.com/>
> Rethink the economics of data
> ________________
> ________________________________
> ATTENTION: -----
> The information contained in this message (including any files transmitted
> with this message) may contain proprietary, trade secret or other
> confidential and/or legally privileged information. Any pricing information
> contained in this message or in any files transmitted with this message is
> always confidential and cannot be shared with any third parties without
> prior written approval from Syncsort. This message is intended to be read
> only by the individual or entity to whom it is addressed or by their
> designee. If the reader of this message is not the intended recipient, you
> are on notice that any use, disclosure, copying or distribution of this
> message, in any form, is strictly prohibited. If you have received this
> message in error, please immediately notify the sender and/or Syncsort and
> destroy all copies of this message in your possession, custody or control.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message