spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Saif Addin (JIRA)" <>
Subject [jira] [Commented] (SPARK-21198) SparkSession catalog is terribly slow
Date Mon, 26 Jun 2017 02:36:01 GMT


Saif Addin commented on SPARK-21198:

I'll take a look again on Monday, but I am positive that the catalog retrieve is causing the
delay. I'll also share some piece of code of how I do it just in case. For now I am using
spark.sqlContext.tableNames again.

> SparkSession catalog is terribly slow
> -------------------------------------
>                 Key: SPARK-21198
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Saif Addin
> We have a considerably large Hive metastore and a Spark program that goes through Hive
data availability.
> In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and sqlContext.isCached()
to go throgh Hive metastore information.
> Once migrated to spark 2.x we switched over SparkSession.catalog instead, but it turns
out that both listDatabases() and listTables() take between 5 to 20 minutes depending on the
database to return results, using operations such as the following one:
> spark.catalog.listTables(db).filter(__.isTemporary).map(
> and made the program unbearably slow to return a list of tables.
> I know we still have spark.sqlContext.tableNames as workaround but I am assuming this
is going to be deprecated anytime soon?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message