Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D73AFD28D for ; Thu, 17 Jan 2013 23:04:15 +0000 (UTC) Received: (qmail 84026 invoked by uid 500); 17 Jan 2013 23:04:15 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 83982 invoked by uid 500); 17 Jan 2013 23:04:15 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 83971 invoked by uid 99); 17 Jan 2013 23:04:15 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Jan 2013 23:04:15 +0000 Date: Thu, 17 Jan 2013 23:04:15 +0000 (UTC) From: "Mariappan Asokan (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MAPREDUCE-4808) Refactor MapOutput and MergeManager to facilitate reuse by Shuffle implementations MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556705#comment-13556705 ] Mariappan Asokan commented on MAPREDUCE-4808: --------------------------------------------- Hi Arun, I will try to explain a simple use case of an external implementation of merge on the reduce side. Let us say this merge implementation has some fixed area of memory (Java byte array) allocated to store the shuffled data. This may be done to avoid frequent garbage collection by JVM or for better processor cache efficiency. Looking at the methods in the {{Merge}} class, they either accept input to the merge in disk files(array of {{Path}} objects) or memory segments(list of {{Segment}} objects.) The former is not suitable since merge is done in memory first and any intermediate merged output file is under the control of the plugin implementation. The latter is not suitable because memory for the shuffled data is not under the control of the plugin implementation. Ideally, if an {{InputStream}} object is available, the external implementation can read shuffled data from the stream to the fixed area of memory at a specific offset in the byte array. With the {{MergeManagerPlugin,}} the external implementation will get the HTTP connection's {{InputStream}} object via the {{shuffle()}} method in {{MapOutput}} object. In addition, if merge goes though multiple passes because the memory area is limited in size, there should be some way for the {{Shuffle}} to wait until memory is released by a merge pass. There is no method in {{Merge}} for that either. I find that it is possible to define the interaction points between current {{Shuffle}} and {{MergeManager}} using the {{MergeManagerPlugin}} interface. The plugin interface has only three methods and it allows the external plugin to have a lot of freedom in its implementation. As a side effect, the {{MapOutput}} is also refactored. Hope I explained this well. If you have any questions, please let me know. -- Asokan > Refactor MapOutput and MergeManager to facilitate reuse by Shuffle implementations > ---------------------------------------------------------------------------------- > > Key: MAPREDUCE-4808 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4808 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Reporter: Arun C Murthy > Assignee: Mariappan Asokan > Attachments: COMBO-mapreduce-4809-4812-4808.patch, mapreduce-4808.patch, mapreduce-4808.patch, mapreduce-4808.patch, mapreduce-4808.patch, mapreduce-4808.patch, mapreduce-4808.patch, mapreduce-4808.patch, MergeManagerPlugin.pdf > > > Now that Shuffle is pluggable (MAPREDUCE-4049), it would be convenient for alternate implementations to be able to reuse portions of the default implementation. > This would come with the strong caveat that these classes are LimitedPrivate and Unstable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira