hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Dietze (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-10891) Limited fetch on partitioned table can eat up all heap
Date Tue, 02 Jun 2015 15:38:29 GMT

    [ https://issues.apache.org/jira/browse/HIVE-10891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569286#comment-14569286
] 

Christian Dietze commented on HIVE-10891:
-----------------------------------------

It seems that the [SimpleFetchOptimizer|https://github.com/apache/hive/blob/branch-1.1/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SimpleFetchOptimizer.java]
acts a little bit to aggressive here. From my understanding of the code there's a check if
the filter only affects columns that are partition keys. In this case the threshold check
is bypassed (see [line 147 of SimpleFetchOptimizer|https://github.com/apache/hive/blob/branch-1.1/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SimpleFetchOptimizer.java#L147]).
In the upper query, we filter on a different column, nevertheless the filter is bypassed due
to [these lines|https://github.com/apache/hive/blob/branch-1.1/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SimpleFetchOptimizer.java#L200]:
{code:java}
if (PartitionPruner.onlyContainsPartnCols(table, pruner)) {
    bypassFilter = !pctx.getPrunedPartitions(alias, ts).hasUnknownPartitions();
}
{code}

A workaround seems to be, to put the optimizer on a leash by setting 

{code:xml}
<property>
    <name>hive.fetch.task.conversion</name>
    <value>minimal</value>
</property>
{code}

> Limited fetch on partitioned table can eat up all heap
> ------------------------------------------------------
>
>                 Key: HIVE-10891
>                 URL: https://issues.apache.org/jira/browse/HIVE-10891
>             Project: Hive
>          Issue Type: Bug
>          Components: Physical Optimizer
>    Affects Versions: 1.1.0
>            Reporter: Christoph Lipka
>
> When doing a query like 
> {code}
> select *
> from partitioned_table
> where not_the_partition_key_column = "xyz"
> limit 100
> {code}
> it is executed in memory. For all tables except the smallest this behavior quickly consumes
the complete heap and crashes the server.
> If the limit clause is omitted, a mr-job is started and the query is executed without
memory issues. One can also work around this problem by extending the query to also select
the partition_key like 
> {code}
> select *
> from partitioned_table a
> where a.not_the_partition_key_column = "xyz"
> and a.partition_key_column = (select b.partition_key_column from partitioned_table b)
> limit 100
> {code}
> In this case hive also creates a mr-job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message