hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Siying Dong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-2201) reduce name node calls in hive by creating temporary directories
Date Fri, 24 Jun 2011 18:11:47 GMT

    [ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054595#comment-13054595
] 

Siying Dong commented on HIVE-2201:
-----------------------------------

Yongqiang:
1. As I commented previously "According to Hairong Kuang, Hadoop's behavior for creating a
new file is that it will automatically create it's parent directory if it doesn't exist. In
that case, I removed the directory check and create part when writing to a new file."
2. I tested the codes. I ran the whole regression tests and tested several cases manually
in the cluster. I tried to kill some tasks manually
3. I'll see whether there are another dependency so that I can remove the old one. Having
two reloaded calls are the convention we have in the file. All other similar calls have one
function with Path call and one with String call. 
4. The tree traversal logic is copied from localizeMRTmpFilesImpl(). The first look is to
go through every operator tree. The second loop is to Breadth-First Search the operator tree
to check any FileSyncOperator.
5. OK. I'll make the change. My understanding is that only FileSinkOperator and the BlockMerge
file sink have the problem and the second one is going to have some large changes by HIVE-2035.
Also BlockMerge file sink suffers the problem less as it runs faster that has less change
to have incomplete results.

> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
>                 Key: HIVE-2201
>                 URL: https://issues.apache.org/jira/browse/HIVE-2201
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Namit Jain
>            Assignee: Siying Dong
>         Attachments: HIVE-2201.1.patch, HIVE-2201.2.patch, HIVE-2201.3.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message