spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From liancheng <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-5182] [SPARK-5528] [SQL] WIP: Parquet d...
Date Mon, 02 Feb 2015 12:18:21 GMT
GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/4308

    [SPARK-5182] [SPARK-5528] [SQL] WIP: Parquet data source improvements

    This PR adds three major improvements to Parquet data source:
    
    1.  Partition discovery
    
        While reading Parquet files resides in Hive style partition directories, `ParquetRelation2`
automatically discovers partitioning information and infers partition column types.
    
        This is also a partial work for [SPARK-5182] [1], which aims to provide first class
partitioning support for the data source API.  Related code in this PR can be easily extracted
to the data source API level in future versions.
    
    1.  Schema merging
    
        When enabled, Parquet data source collects schema information from all Parquet part-files
and tries to merge them.  Exceptions are thrown when incompatible schemas are detected.  This
feature is controlled by data source option `parquet.mergeSchema`, and is enabled by default.
    
    1.  Metastore Parquet table conversion moved to analysis phase
    
        This greatly simplifies the conversion logic.  `ParquetConversion` strategy can be
removed once the old Parquet implementation is removed in the future.
    
    This version of Parquet data source aims to entirely replace the old Parquet implementation.
 However, the old version hasn't been removed yet.  Users can fall back to the old version
by turning off SQL configuration `spark.sql.parquet.useDataSourceApi`.
    
    Other JIRA tickets fixed as side effects in this PR:
    
    - [SPARK-5509] [3]: `EqualTo` now uses a proper `Ordering` to compare binary types.
    
    - [SPARK-3575] [4]: Metastore schema is now preserved and passed to `ParquetRelation2`
via data source option `parquet.metastoreSchema`.
    
    TODO:
    
    - [ ] More test cases for partition discovery
    - [ ] Fix write path after data source write support (#4294) is merged
    
          It turned out to be non-trivial to fall back to old Parquet implementation on the
write path when Parquet data source is enabled.  Since we're planning to include data source
write support in 1.3.0, I simply ignored two test cases involving Parquet insertion for now.
    
    - [ ] Fix outdated comments and documentations
    
    PS: More than a half of changed lines in this PR are trivial changes to test cases. To
test Parquet with and without the new data source, almost all Parquet test cases are moved
into wrapper driver functions. This introduces hundreds of lines of changes, etc.).
    
    [1]: https://issues.apache.org/jira/browse/SPARK-5182
    [2]: https://issues.apache.org/jira/browse/SPARK-5528
    [3]: https://issues.apache.org/jira/browse/SPARK-5509
    [4]: https://issues.apache.org/jira/browse/SPARK-3575


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark parquet-partition-discovery

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4308.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4308
    
----
commit c0f220f76b15eafecfa14cc7021a2472384c8d14
Author: Cheng Lian <lian@databricks.com>
Date:   2015-01-30T01:41:18Z

    Draft version of Parquet partition discovery and schema merging

commit 5c405a8f3f8c511454e268379ec2348bdcb8902e
Author: Cheng Lian <lian@databricks.com>
Date:   2015-02-01T00:23:27Z

    Fixes all existing Parquet test suites except for ParquetMetastoreSuite

commit 5a5e18ed2e213904525375643ef7a2e1e34a590e
Author: Cheng Lian <lian@databricks.com>
Date:   2015-02-02T04:34:09Z

    Fixes Metastore Parquet table conversion

commit af3683ea68d3efe7c0368cb8d23fdd661fbfeffc
Author: Cheng Lian <lian@databricks.com>
Date:   2015-02-02T11:30:01Z

    Uses switch to control whether use Parquet data source or not

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message