Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 7E01E200C7C for ; Mon, 22 May 2017 04:49:10 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 718A3160BCF; Mon, 22 May 2017 02:49:10 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8C674160BC5 for ; Mon, 22 May 2017 04:49:09 +0200 (CEST) Received: (qmail 66367 invoked by uid 500); 22 May 2017 02:49:08 -0000 Mailing-List: contact issues-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list issues@hive.apache.org Received: (qmail 66358 invoked by uid 99); 22 May 2017 02:49:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 May 2017 02:49:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 4704F1A7B1A for ; Mon, 22 May 2017 02:49:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 2hnpy0rlTXQI for ; Mon, 22 May 2017 02:49:06 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 7454E5FC85 for ; Mon, 22 May 2017 02:49:06 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id C6A1DE0D50 for ; Mon, 22 May 2017 02:49:05 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id DD26A21B57 for ; Mon, 22 May 2017 02:49:04 +0000 (UTC) Date: Mon, 22 May 2017 02:49:04 +0000 (UTC) From: "yangfang (JIRA)" To: issues@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-16666) Set hive.exec.stagingdir a relative directory or a sub directory of distination data directory will cause Hive to delete the intermediate query results MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 22 May 2017 02:49:10 -0000 [ https://issues.apache.org/jira/browse/HIVE-16666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019097#comment-16019097 ] yangfang commented on HIVE-16666: --------------------------------- [~aihuaxu],[~pvary], thanks for your advice. In my opinion, the staging directory is just a temporary directory, users may not be concerned with where the directory is, they only care about the final result. For users, any staging directory name may be allowed, throw an exception may be a little rough. Even if we add a validation against the configuration, for example suppose /tmp/hive/.hive-staging is a valide directory because it's a empty directory that no one has used, but now, someone may create table like this: create table test(a int, b string) location '/tmp' Now the staging directory is a sub directory of table data directory, this will still to delete the intermediate query results in execution. Looking forward to your comments. > Set hive.exec.stagingdir a relative directory or a sub directory of distination data directory will cause Hive to delete the intermediate query results > ------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: HIVE-16666 > URL: https://issues.apache.org/jira/browse/HIVE-16666 > Project: Hive > Issue Type: Bug > Components: Query Processor > Affects Versions: 3.0.0 > Reporter: yangfang > Assignee: yangfang > Priority: Critical > Attachments: HIVE-16666.1.patch > > > Set hive.exec.stagingdir=./*, for example set hive.exec.stagingdir=./opq8. > Then excute a query like this: > insert overwrite table test2 select * from test3; > You will get the error like this: > hive> set hive.exec.stagingdir=./opq8; > hive> insert overwrite table test2 select * from test3; > Query ID = mr_20170515134831_28ee392d-0d5a-4e47-b80c-dfcd31691b02 > Total jobs = 3 > Launching Job 1 out of 3 > Number of reduce tasks is set to 0 since there's no reduce operator > Starting Job = job_1494818119523_0008, Tracking URL = http://zdh77:8088/proxy/application_1494818119523_0008/ > Kill Command = /opt/ZDH/parcels/lib/hadoop/bin/hadoop job -kill job_1494818119523_0008 > Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0 > 2017-05-15 13:48:51,487 Stage-1 map = 0%, reduce = 0% > Ended Job = job_1494818119523_0008 > Stage-3 is selected by condition resolver. > Stage-2 is filtered out by condition resolver. > Stage-4 is filtered out by condition resolver. > Moving data to directory hdfs://nameservice/hive/test2/opqt8_hive_2017-05-15_13-48-31_558_6151032330134038151-1/-ext-10000 > Loading data to table default.test2 > Moved: 'hdfs://nameservice/hive/test2/opqt8_hive_2017-05-15_13-48-31_558_6151032330134038151-1' to trash at: hdfs://nameservice/user/mr/.Trash/Current > Failed with exception Unable to move source hdfs://nameservice/hive/test2/opqt8_hive_2017-05-15_13-48-31_558_6151032330134038151-1/-ext-10000 to destination hdfs://nameservice/hive/test2 > FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. Unable to move source hdfs://nameservice/hive/test2/opqt8_hive_2017-05-15_13-48-31_558_6151032330134038151-1/-ext-10000 to destination hdfs://nameservice/hive/test2 > MapReduce Jobs Launched: > Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 SUCCESS > Total MapReduce CPU Time Spent: 0 msec > hive> > hive.exec.stagingdir=./opq8 is a relative path for destination write directory /hive/test2. Hive will create a temporary directory /hive/test2/opq8_hive* for intermediate query results. Later in the move staging, Hive will delete or trash the sub directory under the /hive/test2 who's name does not begin with "_" or "." in order to move data to this directory. You can see its processing logic in org.apache.hadoop.hive.ql.metadata.trashFilesUnderDir. > My modification method is: if stagingdir is a sub directory of the destination write directory. I add a "." in front of stagingdir. now temporary directory will be /hive/test2/.opq8_hive* , because the sub directory .opq8_hive* starts with ".", Hive will not delete it. > hive> set hive.exec.stagingdir=./opq8; > hive> insert overwrite table test2 select * from test3; > Query ID = mr_20170515143940_ae48a65e-42be-4f50-b974-b713ca902867 > Total jobs = 3 > Launching Job 1 out of 3 > Number of reduce tasks is set to 0 since there's no reduce operator > Starting Job = job_1494818119523_0012, Tracking URL = http://zdh77:8088/proxy/application_1494818119523_0012/ > Kill Command = /opt/ZDH/parcels/lib/hadoop/bin/hadoop job -kill job_1494818119523_0012 > Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0 > 2017-05-15 14:40:04,547 Stage-1 map = 0%, reduce = 0% > Ended Job = job_1494818119523_0012 > Stage-3 is selected by condition resolver. > Stage-2 is filtered out by condition resolver. > Stage-4 is filtered out by condition resolver. > Moving data to directory hdfs://nameservice/hive/test2/.opqt8_hive_2017-05-15_14-39-40_751_1221840798987515724-1/-ext-10000 > Loading data to table default.test2 > MapReduce Jobs Launched: > Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 SUCCESS > Total MapReduce CPU Time Spent: 0 msec > OK > Time taken: 26.751 seconds > hive> -- This message was sent by Atlassian JIRA (v6.3.15#6346)