hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Vary (Jira)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-22702) ALTER TABLE REMOVE PARTITION is inefficient
Date Wed, 08 Jan 2020 08:51:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-22702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010482#comment-17010482
] 

Peter Vary commented on HIVE-22702:
-----------------------------------

[~michaelchirico]: You did not mention the version of hive you are using, but I suspect that
it does not contain HIVE-6980, HIVE-19783 as we did some optimization in drop table in these
jiras. Since the fixes are not in any current releases, you might be able to give it a try
by building your own hive.

Thanks,
Peter

> ALTER TABLE REMOVE PARTITION is inefficient
> -------------------------------------------
>
>                 Key: HIVE-22702
>                 URL: https://issues.apache.org/jira/browse/HIVE-22702
>             Project: Hive
>          Issue Type: Improvement
>          Components: Database/Schema
>            Reporter: Michael Chirico
>            Priority: Major
>
> I recently realized the poor partitioning of a table of mine was becoming a major bottleneck
and endeavored to reset the partitioning.
> At this point, the table had about 56K partitions (year|month|day|city) combinations;
moving to the more efficient year|month partitions means there's about 24.
> In the process, I was having trouble fixing the registration of the table because of
the size of its partition DB; I happened upon this SO Q&A which addresses the same issue:
> https://stackoverflow.com/questions/50715939/drop-table-in-hive-via-spark-hangs/50814566#comment105440563_50814566
> I set about batching through ALTER TABLE x DROP PARTITION (...), PARTITION (...) 200
at a time; it would run for about 2 hours to accomplish this, which strikes me as being quite
inefficient.
> (apologies that I haven't done a fully proper analysis of the scaling efficiency in this
ticket)
> If I were designing it from scratch, I would:
> * Keep the database of existing partitions sorted
> * Sort the incoming partitions to remove
> * Iterate via "shrinking binary search" (each partition is searched with binary search,
and we can eliminate from the existing DB anything "less than" the current index when moving
to the next iteration)
> Is there something preventing this from being achieved?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message