Mailing-List: contact dev-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Mon, 28 Apr 2014 21:41:15 +0000 (UTC)
From: "Selina Zhang (JIRA)" <jira@apache.org>
To: hive-dev@hadoop.apache.org
Message-ID: <JIRA.12710832.1398643837991.194609.1398721275682@arcas>
In-Reply-To: <JIRA.12710832.1398643837991@arcas>
References: <JIRA.12710832.1398643837991@arcas>
Subject: [jira] [Commented] (HIVE-6980) Drop table by using direct sql
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HIVE-6980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983605#comment-13983605 ] 

Selina Zhang commented on HIVE-6980:
------------------------------------

[~ashutoshc] Thanks! I have checked HIVE-6809. That patch would not help the use case of drop table. In HIVE-6809, the main problem it is trying to solve is optimize the deletion of partition data. Instead delete the partition path one by one, the new patch tried to delete the paths in bulk. However, in drop table case, this optimization has been done(the method isSubdirectory() check if the partition path is subdirectory of table path). 

The problem causes dropping table very slow is how we drop partitions. 
1.In client side, DDLTask asks server for each Partition object in order to check if the partition is protected. So if we have 100K partitions, 100K Partition objects will be constructed and passing along from the server to client (in a batch way, so won't be OOM here). 
2.After the check, client sends drop table request to the server. And HiveMetaStore begins the fun again. It retrieves each Partition object again and check the data path. Then send the partition names to ObjectStore and ask for deletion (in batch mode).
3. The ObjectStore got a list of partition names. With the fix of HIVE-6252, the DataNucleus will do the real work. It cleans the related metadata tables for each partition. I have turned on the query log for MySQL. For dropping a table with 6 partitions, there were 72 DELETEs, 32 SELECT COUNTs and 17 SELECT DISTINCTs issued.


> Drop table by using direct sql
> ------------------------------
>
>                 Key: HIVE-6980
>                 URL: https://issues.apache.org/jira/browse/HIVE-6980
>             Project: Hive
>          Issue Type: Improvement
>          Components: Metastore
>    Affects Versions: 0.12.0
>            Reporter: Selina Zhang
>            Assignee: Selina Zhang
>
> Dropping table which has lots of partitions is slow. Even after applying the patch of HIVE-6265, the drop table still takes hours (100K+ partitions). 
> The fixes come with two parts:
> 1. use directSQL to query the partitions protect mode;
> the current implementation needs to transfer the Partition object to client and check the protect mode for each partition. I'd like to move this part of logic to metastore. The check will be done by direct sql (if direct sql is disabled, execute the same logic in the ObjectStore);
> 2. use directSQL to drop partitions for table;
> there maybe two solutions here:
> 1. add "DELETE CASCADE" in the schema. In this way we only need to delete entries from partitions table use direct sql. May need to change datanucleus.deletionPolicy = DataNucleus. 
> 2. clean up the dependent tables by issue DELETE statement. This also needs to turn on datanucleus.query.sql.allowAll
> Both of above solutions should be able to fix the problem. The DELETE CASCADE has to change schemas and prepare upgrade scripts. The second solutions added maintenance cost if new tables added in the future releases.
> Please advice. 


--
This message was sent by Atlassian JIRA
(v6.2#6252)