Mailing-List: contact issues-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Mon, 22 May 2017 02:49:04 +0000 (UTC)
From: "yangfang (JIRA)" <jira@apache.org>
To: issues@hive.apache.org
Message-ID: <JIRA.13071925.1494839038000.259589.1495421344904@Atlassian.JIRA>
In-Reply-To: <JIRA.13071925.1494839038000@Atlassian.JIRA>
References: <JIRA.13071925.1494839038000@Atlassian.JIRA> <JIRA.13071925.1494839038375@jira-lw-us.apache.org>
Subject: [jira] [Commented] (HIVE-16666) Set hive.exec.stagingdir a relative
 directory or a sub directory of distination data directory will cause Hive
 to delete the intermediate query results
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Mon, 22 May 2017 02:49:10 -0000


    [ https://issues.apache.org/jira/browse/HIVE-16666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019097#comment-16019097 ] 

yangfang commented on HIVE-16666:
---------------------------------

[~aihuaxu],[~pvary], thanks for your advice.
 In my opinion, the staging directory is just a temporary directory, users may not be concerned with where the directory is, they only care about the final result. For users, any staging directory name may be allowed, throw an exception may be a  little rough.
 Even if we add a validation against the configuration, for example suppose /tmp/hive/.hive-staging is a valide directory because it's a empty directory that no one has used, but now, someone may create table like this:
 create table test(a int, b string) location '/tmp'
Now the staging directory is a sub directory of  table data directory, this will still to delete the intermediate query results in execution.
 Looking forward to your comments.

> Set hive.exec.stagingdir a relative directory or a sub directory of distination data directory will cause Hive to delete the intermediate query results
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-16666
>                 URL: https://issues.apache.org/jira/browse/HIVE-16666
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 3.0.0
>            Reporter: yangfang
>            Assignee: yangfang
>            Priority: Critical
>         Attachments: HIVE-16666.1.patch
>
>
> Set hive.exec.stagingdir=./*,  for example set hive.exec.stagingdir=./opq8.
> Then excute a query like this:
> insert overwrite table test2 select * from test3; 
> You will get the error like this:
> hive> set hive.exec.stagingdir=./opq8;
> hive> insert overwrite table test2 select * from test3;
> Query ID = mr_20170515134831_28ee392d-0d5a-4e47-b80c-dfcd31691b02
> Total jobs = 3
> Launching Job 1 out of 3
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_1494818119523_0008, Tracking URL = http://zdh77:8088/proxy/application_1494818119523_0008/
> Kill Command = /opt/ZDH/parcels/lib/hadoop/bin/hadoop job  -kill job_1494818119523_0008
> Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
> 2017-05-15 13:48:51,487 Stage-1 map = 0%,  reduce = 0%
> Ended Job = job_1494818119523_0008
> Stage-3 is selected by condition resolver.
> Stage-2 is filtered out by condition resolver.
> Stage-4 is filtered out by condition resolver.
> Moving data to directory hdfs://nameservice/hive/test2/opqt8_hive_2017-05-15_13-48-31_558_6151032330134038151-1/-ext-10000
> Loading data to table default.test2
> Moved: 'hdfs://nameservice/hive/test2/opqt8_hive_2017-05-15_13-48-31_558_6151032330134038151-1' to trash at: hdfs://nameservice/user/mr/.Trash/Current
> Failed with exception Unable to move source hdfs://nameservice/hive/test2/opqt8_hive_2017-05-15_13-48-31_558_6151032330134038151-1/-ext-10000 to destination hdfs://nameservice/hive/test2
> FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. Unable to move source hdfs://nameservice/hive/test2/opqt8_hive_2017-05-15_13-48-31_558_6151032330134038151-1/-ext-10000 to destination hdfs://nameservice/hive/test2
> MapReduce Jobs Launched: 
> Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
> Total MapReduce CPU Time Spent: 0 msec
> hive>
> hive.exec.stagingdir=./opq8 is a relative path for destination write directory  /hive/test2.  Hive will create a temporary directory /hive/test2/opq8_hive* for intermediate query results.  Later in the move staging, Hive will delete or trash the sub directory under the /hive/test2 who's name does not begin with "_" or "."  in order to move data to this directory. You can see its processing logic in org.apache.hadoop.hive.ql.metadata.trashFilesUnderDir.
> My modification method is: if  stagingdir is a sub directory of the destination write directory. I add a "."   in front of stagingdir. now temporary directory will be /hive/test2/.opq8_hive* , because the sub directory .opq8_hive* starts with ".",  Hive will not delete it.
> hive> set hive.exec.stagingdir=./opq8;
> hive>  insert overwrite table test2 select * from test3;
> Query ID = mr_20170515143940_ae48a65e-42be-4f50-b974-b713ca902867
> Total jobs = 3
> Launching Job 1 out of 3
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_1494818119523_0012, Tracking URL = http://zdh77:8088/proxy/application_1494818119523_0012/
> Kill Command = /opt/ZDH/parcels/lib/hadoop/bin/hadoop job  -kill job_1494818119523_0012
> Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
> 2017-05-15 14:40:04,547 Stage-1 map = 0%,  reduce = 0%
> Ended Job = job_1494818119523_0012
> Stage-3 is selected by condition resolver.
> Stage-2 is filtered out by condition resolver.
> Stage-4 is filtered out by condition resolver.
> Moving data to directory hdfs://nameservice/hive/test2/.opqt8_hive_2017-05-15_14-39-40_751_1221840798987515724-1/-ext-10000
> Loading data to table default.test2
> MapReduce Jobs Launched: 
> Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
> Total MapReduce CPU Time Spent: 0 msec
> OK
> Time taken: 26.751 seconds
> hive> 


--
This message was sent by Atlassian JIRA
(v6.3.15#6346)