spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-8690) Add a setting to disable SparkSQL parquet schema merge by using datasource API
Date Sun, 28 Jun 2015 10:47:05 GMT

     [ https://issues.apache.org/jira/browse/SPARK-8690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Apache Spark reassigned SPARK-8690:
-----------------------------------

    Assignee: Apache Spark

> Add a setting to disable SparkSQL parquet schema merge by using datasource API 
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-8690
>                 URL: https://issues.apache.org/jira/browse/SPARK-8690
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.4.0
>         Environment: all
>            Reporter: thegiive
>            Assignee: Apache Spark
>            Priority: Minor
>
> We need a general config to disable the parquet schema merge feature. 
> Our sparkSQL application requirement is 
> # In spark 1.1, 1.2, sparkSQL read parquet time is around 1~5 sec. We don't want increase
too much read parquet time. Around 2000 parquet file,  the schema is the same. So we don't
need  schema merge feature
> # We need to use datasource API's feature like partition discovery. So we cannot use
Spark 1.2 or pervious version 
> # We have a lot of SparkSQL product. We use *sqlContext.parquetFile(filename)* to read
the parquet file. We don't want to change the application code. One setting to disable this
feature is what we want 
> In  1.4, we have serval method. But both of them cannot perfect match our use case 
> # Set spark.sql.parquet.useDataSourceApi to false. It will match requirement 1,3. But
it will use old parquet API and fail in requirement 2 
> # Use sqlContext.load("parquet" , Map( "path" -> "..." , "mergeSchema" -> "false"
))  will meet requirement 1,2. But it need to change a lot of code we use in parquet load.

> # Spark 1.4 improve a lot on schema merge than 1.3. But directly use default version
of parquet will increase the load time from 1~5 sec to 100 sec. It will fail requirement 1.

> # Try PR 5231 config. But it  cannot disable schema merge. 
> I think it is better to use a config to disable datasource API's schema merge feature.
A PR will be provide later 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message