Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6BB014935 for ; Fri, 10 Jun 2011 07:35:22 +0000 (UTC) Received: (qmail 82237 invoked by uid 500); 10 Jun 2011 07:35:22 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 82201 invoked by uid 500); 10 Jun 2011 07:35:22 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 82190 invoked by uid 500); 10 Jun 2011 07:35:22 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 82187 invoked by uid 99); 10 Jun 2011 07:35:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Jun 2011 07:35:22 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Jun 2011 07:35:19 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id D8B5510CDAF for ; Fri, 10 Jun 2011 07:34:58 +0000 (UTC) Date: Fri, 10 Jun 2011 07:34:58 +0000 (UTC) From: "Siying Dong (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: <972397845.9980.1307691298884.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <418421579.2546.1307408098971.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (HIVE-2201) reduce name node calls in hive by creating temporary directories MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siying Dong updated HIVE-2201: ------------------------------ Attachment: HIVE-2201.1.patch Implemented the logic. Discovered one problem: when moving from /tmp1/_tmp_1 to /tmp2/1, we might need to check whether /tmp2 exists before moving it. This patch avoids this call by pre-create the temp directory before submitting the job. However, we cannot do that for dynamic partitioning as we don't know the directory names. So for dynamic partitioning, we have some extra costs added for DFS namenode read. So far I think this tradeoff is worthwhile. Potentially this cost can be reduced it by caching directories created. We can try that approach as a followup. > reduce name node calls in hive by creating temporary directories > ---------------------------------------------------------------- > > Key: HIVE-2201 > URL: https://issues.apache.org/jira/browse/HIVE-2201 > Project: Hive > Issue Type: Improvement > Reporter: Namit Jain > Assignee: Siying Dong > Attachments: HIVE-2201.1.patch > > > Currently, in Hive, when a file gets written by a FileSinkOperator, > the sequence of operations is as follows: > 1. In tmp directory tmp1, create a tmp file _tmp_1 > 2. At the end of the operator, move > /tmp1/_tmp_1 to /tmp1/1 > 3. Move directory /tmp1 to /tmp2 > 4. For all files in /tmp2, remove all files starting with _tmp and > duplicate files. > Due to speculative execution, a lot of temporary files are created > in /tmp1 (or /tmp2). This leads to a lot of name node calls, > specially for large queries. > The protocol above can be modified slightly: > 1. In tmp directory tmp1, create a tmp file _tmp_1 > 2. At the end of the operator, move > /tmp1/_tmp_1 to /tmp2/1 > 3. Move directory /tmp2 to /tmp3 > 4. For all files in /tmp3, remove all duplicate files. > This should reduce the number of tmp files. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira