spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Saif Addin (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-21198) SparkSession catalog is terribly slow
Date Sat, 24 Jun 2017 17:27:01 GMT

    [ https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062070#comment-16062070
] 

Saif Addin edited comment on SPARK-21198 at 6/24/17 5:26 PM:
-------------------------------------------------------------

Thanks [~viirya]
My program lists on a webpage the list of available tables by clicking a dropdown list of
databases. Tables also show schema and metadata. At the beggining of my program, I go through
all tables to extract necessary information, so I necessarily have to go through all tables
at least once. When I migrated over catalog, I thought my program got stuck, but it was just
taking too long (20 to 30 minutes)

Each time people click dropdown, I re-request the table list to ensure I keep the list up-to-date.
As database list and each table schema takes too long to request dynamically, I store them
in a cache as people use them. But I would love this process took less time (Schema, isCached,
isTemporary).

If you may take other suggestions, since TempViews always appear in a list tables, I have
to do some manual logic to extract TempViews from requested tables list.

Also, isCached comes from SparkSession, not from the same place where catalog information
is requested.

Our amount of tables is not insane (about 20 dbs and tops 200 tables per db, with some dbs
only with a bunch of tables instead)

Best
Saif


was (Author: revolucion09):
Thanks [~viirya]
My program lists on a webpage the list of available tables by clicking a dropdown list of
databases. Tables also show schema and metadata. At the beggining of my program, I go through
all tables to extract necessary information, so I necessarily have to go through all tables
at least once. When I migrated over catalog, I though my program got stuck, but it was just
taking too long (20 to 30 minutes)

Each time people click dropdown, I re-request the table list to ensure I keep the list up-to-date.
As database list and each table schema takes too long to request dynamically, I store them
in a cache as people use them. But I would love this process took less time (Schema, isCached,
isTemporary).

If you may take other suggestions, since TempViews always appear in a list tables, I have
to do some manual logic to extract TempViews from requested tables list.

Also, isCached comes from SparkSession, not from the same place where catalog information
is requested.

Our amount of tables is not insane (about 20 dbs and tops 200 tables per db, with some dbs
only with a bunch of tables instead)

Best
Saif

> SparkSession catalog is terribly slow
> -------------------------------------
>
>                 Key: SPARK-21198
>                 URL: https://issues.apache.org/jira/browse/SPARK-21198
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0
>            Reporter: Saif Addin
>
> We have a considerably large Hive metastore and a Spark program that goes through Hive
data availability.
> In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and sqlContext.isCached()
to go throgh Hive metastore information.
> Once migrated to spark 2.x we switched over SparkSession.catalog instead, but it turns
out that both listDatabases() and listTables() take between 5 to 20 minutes depending on the
database to return results, using operations such as the following one:
> spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect
> and made the program unbearably slow to return a list of tables.
> I know we still have spark.sqlContext.tableNames as workaround but I am assuming this
is going to be deprecated anytime soon?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message