Mailing-List: contact dev-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Fri, 10 Jun 2011 07:34:58 +0000 (UTC)
From: "Siying Dong (JIRA)" <jira@apache.org>
To: hive-dev@hadoop.apache.org
Message-ID: 
 <972397845.9980.1307691298884.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <418421579.2546.1307408098971.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Updated] (HIVE-2201) reduce name node calls in hive by
 creating temporary directories
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-2201:
------------------------------

    Attachment: HIVE-2201.1.patch

Implemented the logic.
Discovered one problem: when moving from /tmp1/_tmp_1 to /tmp2/1, we might need to check whether /tmp2 exists before moving it. This patch avoids this call by pre-create the temp directory before submitting the job. However, we cannot do that for dynamic partitioning as we don't know the directory names. So for dynamic partitioning, we have some extra costs added for DFS namenode read. So far I think this tradeoff is worthwhile. Potentially this cost can be reduced it by caching directories created. We can try that approach as a followup.

> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
>                 Key: HIVE-2201
>                 URL: https://issues.apache.org/jira/browse/HIVE-2201
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Namit Jain
>            Assignee: Siying Dong
>         Attachments: HIVE-2201.1.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira