Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4AE4BE1DF for ; Fri, 7 Dec 2012 05:55:25 +0000 (UTC) Received: (qmail 37352 invoked by uid 500); 7 Dec 2012 05:55:24 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 37269 invoked by uid 500); 7 Dec 2012 05:55:23 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 37215 invoked by uid 500); 7 Dec 2012 05:55:21 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 37202 invoked by uid 99); 7 Dec 2012 05:55:21 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Dec 2012 05:55:21 +0000 Date: Fri, 7 Dec 2012 05:55:20 +0000 (UTC) From: "Namit Jain (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-3733) Improve Hive's logic for conditional merge MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526155#comment-13526155 ] Namit Jain commented on HIVE-3733: ---------------------------------- [~pkamath], can you load the complete patch ? Ideally, the sub-query should not matter. By the time you are looking at the FileSinkDesc for insert, the union should have been processed. You should not look at the complete stack, but the first operator which would break the tree, which is union in this case. Thinking more about it, this approach is fairly difficult to get right. The more I think about it, the more I like the earlier idea, of moving the merge to a physical optimizer. The tasks have already been broken up, and the stack would be well defined. We dont have to hack this up. [~kevinwilfong], what do you think ? > Improve Hive's logic for conditional merge > ------------------------------------------ > > Key: HIVE-3733 > URL: https://issues.apache.org/jira/browse/HIVE-3733 > Project: Hive > Issue Type: Improvement > Reporter: Pradeep Kamath > Assignee: Pradeep Kamath > Attachments: HIVE-3733.1.patch.txt, HIVE-3733.3.patch.txt, HIVE-3733.4.patch.txt > > > If the config hive.merge.mapfiles is set to true and hive.merge.mapredfiles is set to false then when hive encounters a FileSinkOperator when generating map reduce tasks, it will look at the entire job to see if it has a reducer, if it does it will not merge. Instead it should be check if the FileSinkOperator is a child of the reducer. This means that outputs generated in the mapper will be merged, and outputs generated in the reducer will not be, the intended effect of setting those configs. > Simple repro: > set hive.merge.mapfiles=true; > set hive.merge.mapredfiles=false; > EXPLAIN > FROM > INSERT OVERWRITE TABLE SELECT key, COUNT(*) group by key > INSERT OVERWRITE TABLE SELECT *; > The output should contain a Conditional Operator, Mapred Stages, and Move tasks -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira