spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Ivanov (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-25925) Spark 2.3.1 retrieves all partitions from Hive Metastore by default
Date Fri, 02 Nov 2018 05:09:00 GMT
Alex Ivanov created SPARK-25925:
-----------------------------------

             Summary: Spark 2.3.1 retrieves all partitions from Hive Metastore by default
                 Key: SPARK-25925
                 URL: https://issues.apache.org/jira/browse/SPARK-25925
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.1
            Reporter: Alex Ivanov


Spark 2.3.1 comes with the following _spark-defaults.conf_ parameters by default:
{code:java}
spark.sql.hive.convertMetastoreParquet true
spark.sql.hive.metastorePartitionPruning true
spark.sql.hive.caseSensitiveInferenceMode INFER_AND_SAVE{code}
While the first two properties are fine, the last one has an unfortunate side-effect. I realize
it's set to INFER_AND_SAVE for a reason, namely https://issues.apache.org/jira/browse/SPARK-19611, however
that also causes an issue.

The problem is at this point:
[https://github.com/apache/spark/blob/a2f502cf53b6b00af7cb80b6f38e64cf46367595/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L232]

The inference causes all partitions to be retrieved for the table from Hive Metastore. This
is a problem because even running *explain* on a simple query on a table with thousands of
partitions seems to hang, and is very difficult to debug.

Moreover, many people will address the issue by changing:
{code:java}
spark.sql.hive.convertMetastoreParquet false{code}
see that it works, and call it a day, thereby forgoing the benefits of using Parquet support
in Spark directly. In our experience, this causes significant slow-downs on at least some
queries.

This Jira is mostly to document the issue, even if it cannot be addressed, so that people
who inevitably run into this behavior can see the resolution, which is changing the parameter
to *NEVER_INFER*, provided there are no issues with Parquet-Hive schema compatibility, i.e.
all of the schema is in lower-case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message